You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2016/07/26 02:21:23 UTC

[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/14359

    [SPARK-16719][ML] Random Forests should communicate fewer trees on each iteration

    ## What changes were proposed in this pull request?
    
    RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly.
    
    This PR:
    (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a few trees before focusing on others.
    (b) Change topNodes to pass only the trees required on that iteration.
    
    ## How was this patch tested?
    
    Unit tests:
    * Existing tests for correctness of tree learning
    * New test: NodeQueue should be FILO
    * Manually modifying code and running tests to verify that a small number of trees are communicated on each iteration
      * This last item is hard to test via unit tests given the current APIs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark rfs-fewer-trees

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14359.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14359
    
----
commit c76ec826cac1f2c43d10bf18ca5a914a572fa7b7
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-04-21T18:01:28Z

    changed RandomForest node queue to be FILO and changed topNodes to pass only trees in each group

commit 4547a44796420ea8504e9840789893b279eae0e5
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-04-27T03:42:16Z

    Added test for FILO nodeQueue

commit 6fcfb4b0e158ba86371ad4d0728490f3a8e7caeb
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-07-26T02:21:14Z

    minor cleanups

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    If it is not urgent, I'd like to try some large scale training to understand more about the improvements. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #65798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65798/consoleFull)** for PR 14359 at commit [`d16c2da`](https://github.com/apache/spark/commit/d16c2da0f53371aea39c85426015ba46d2a0c27e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by jodersky <gi...@git.apache.org>.
Github user jodersky commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r72322067
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Class for queueing nodes to split on each iteration.
    +   * This is a FILO queue.
    +   */
    +  private[impl] class NodeQueue {
    +    private var q: mutable.MutableList[(Int, LearningNode)] =
    +      mutable.MutableList.empty[(Int, LearningNode)]
    +
    +    /** Put (treeIndex, node) into queue. Constant time. */
    +    def put(treeIndex: Int, node: LearningNode): Unit = {
    +      q += ((treeIndex, node))
    +    }
    +
    +    /** Remove and return last inserted element.  Linear time (unclear in Scala docs though) */
    --- End diff --
    
    How critical is this for performance? I know that `append` is O(1) for scala mutable lists but `dropRight(1)` is actually implemented in a parent class [1] of `MutableList` and therefores does not take advantage of the list's reference to its last element. All in all, from it's implementation it seems that `dropRight` is O(n)
    
    [1] https://github.com/scala/scala/blob/v2.11.8/src/library/scala/collection/LinearSeqOptimized.scala#L194-L204


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r72197712
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Class for queueing nodes to split on each iteration.
    +   * This is a FILO queue.
    +   */
    +  private[impl] class NodeQueue {
    +    private var q: mutable.MutableList[(Int, LearningNode)] =
    +      mutable.MutableList.empty[(Int, LearningNode)]
    +
    +    /** Put (treeIndex, node) into queue. Constant time. */
    +    def put(treeIndex: Int, node: LearningNode): Unit = {
    +      q += ((treeIndex, node))
    +    }
    +
    +    /** Remove and return last inserted element.  Linear time (unclear in Scala docs though) */
    --- End diff --
    
    "a FILO queue" is an unconventional saying. I like it though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #64020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64020/consoleFull)** for PR 14359 at commit [`133fdbf`](https://github.com/apache/spark/commit/133fdbf9972745007eee4e68ffff703b3c9ad68e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jodersky <gi...@git.apache.org>.
Github user jodersky commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    > I switched to Stack and then realized Stack has been deprecated in Scala 2.11...
    
    I think you probably read the *immutable* stack docs; the *mutable* stack is not deprecated AFAIK. I can imagine that having a custom stack implementation may allow for additional operations in the future, however we should also consider that using standard collections reduces the load for anyone who will maintain the code then.
    
    Btw, I highly recommend to use the [milestone scaladocs](http://www.scala-lang.org/api/2.12.0-M5/scala/collection/mutable/Stack.html) over the current ones. Although 2.12 is not officially out yet, the changes to the library are minimal and the UI is much more pleasant to use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #63886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63886/consoleFull)** for PR 14359 at commit [`41f4297`](https://github.com/apache/spark/commit/41f4297f7602c062c78c76b2215397830ed7b6af).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    LGTM pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Sorry for the long delay; I've been swamped by other things for a while.  Re-emerging...
    
    I switched to Stack and then realized Stack has been deprecated in Scala 2.11, so I reverted to the original NodeQueue.  But I renamed NodeQueue to NodeStack to be a bit clearer.
    
    @hhbyyh Any luck testing this at scale?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by jodersky <gi...@git.apache.org>.
Github user jodersky commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r75361458
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -161,31 +161,43 @@ private[spark] object RandomForest extends Logging {
           None
         }
     
    -    // FIFO queue of nodes to train: (treeIndex, node)
    -    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    /*
    +      FILO queue of nodes to train: (treeIndex, node)
    +      We make this FILO by always inserting nodes by appending (+=) and removing with dropRight.
    --- End diff --
    
    the methods described here still refer to the original queue data structure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #62900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62900/consoleFull)** for PR 14359 at commit [`3c00d03`](https://github.com/apache/spark/commit/3c00d03735ac60743744c10831c8b9d27050f315).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64020/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #63822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)** for PR 14359 at commit [`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #63886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63886/consoleFull)** for PR 14359 at commit [`41f4297`](https://github.com/apache/spark/commit/41f4297f7602c062c78c76b2215397830ed7b6af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Hi Joseph, Sorry for the late response. I was occupied by a customer Spark project for the past month.
    
    The idea looks reasonable and I tested with MNist dataset and the overall run time decrease from 245 seconds to 225 seconds on average. LGTM.
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #62858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62858/consoleFull)** for PR 14359 at commit [`6fcfb4b`](https://github.com/apache/spark/commit/6fcfb4b0e158ba86371ad4d0728490f3a8e7caeb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    This is a really nice improvement. The communication overhead is reduced, based on some simple local tests. I wonder how we can add a test to verify that the algorithm focuses on completing whole trees at once. Potentially, we can add a test of `selectNodesToSplit` to verify that it chooses nodes from fewer number of trees, but I'm not sure it's necessary. Thoughts?
    
    Also, it might not be too hard to take this a step further. We could group the nodes to be trained by tree, and keep track of the amount of memory they require. Then to select nodes to split, we can simply pick off the trees that require the most memory until we exceed the threshold. This way we truly minimize the number of trees while still occupying the memory size. We could leave it for another JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #65798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65798/consoleFull)** for PR 14359 at commit [`d16c2da`](https://github.com/apache/spark/commit/d16c2da0f53371aea39c85426015ba46d2a0c27e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62900/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #62900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62900/consoleFull)** for PR 14359 at commit [`3c00d03`](https://github.com/apache/spark/commit/3c00d03735ac60743744c10831c8b9d27050f315).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62858/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    @hhbyyh This is an improvement I had implemented a while back, just a little too late for the 2.0 code freeze.  Could you please help review it or find others?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jodersky <gi...@git.apache.org>.
Github user jodersky commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    @jkbradley , you can find the scaladoc on stacks here  http://www.scala-lang.org/api/current/index.html#scala.collection.mutable.Stack
    
    Also this document http://docs.scala-lang.org/overviews/collections/performance-characteristics gives a nice overview of the different collection types in scala and their performances



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Thanks @jodersky !  Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Thanks @hhbyyh and @sethah !
    
    I agree that a later PR could be more careful about which trees are completed in which order and test this more thoroughly.  But I hope this takes us 80% of the way there.  If it's Ok with you, I'd like to go ahead and merge it as is once tests pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #62858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62858/consoleFull)** for PR 14359 at commit [`6fcfb4b`](https://github.com/apache/spark/commit/6fcfb4b0e158ba86371ad4d0728490f3a8e7caeb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r72338537
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Class for queueing nodes to split on each iteration.
    +   * This is a FILO queue.
    +   */
    +  private[impl] class NodeQueue {
    +    private var q: mutable.MutableList[(Int, LearningNode)] =
    +      mutable.MutableList.empty[(Int, LearningNode)]
    +
    +    /** Put (treeIndex, node) into queue. Constant time. */
    +    def put(treeIndex: Int, node: LearningNode): Unit = {
    +      q += ((treeIndex, node))
    +    }
    +
    +    /** Remove and return last inserted element.  Linear time (unclear in Scala docs though) */
    --- End diff --
    
    Whoops, I missed that List.tail is O(1) time.  I switched to a List.  Thanks all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65798/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #63822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)** for PR 14359 at commit [`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Ahh, you're right; I was looking at immutable.  I'll update to use the mutable stack.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63822/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Btw, to give back-of-the-envelope estimates, we can look at 2 numbers:
    (1) How many nodes will be split on each iteration?
    (2) How big is the forest which is serialized and sent to workers on each iteration?
    
    For (1), here's an example:
    * 1000 features, each with 50 bins -> 50000 possible splits
    * set maxMemoryInMB = 256 (default)
    * regression => 3 Double values per possible split
    * 256 * 10^6 / (3 * 50000 * 8) = 213 nodes/iteration
    
    This implies that for trees of depth > 8 or so, many iterations will only split nodes from 1 or 2 trees.  I.e., we should avoid communicating most trees.
    
    For (2), the forest can be pretty expensive to send.
    * Each node:
      * leaf node: 5 Doubles
      * internal node: ~8 Doubles/references + Split
        * Split: O(# categories) or 2 values for continuous, say 3 Doubles on average
      * => say 8 Doubles/node on average
    * 100 trees of depth 8 => 25600 nodes => 1.6MB
    * 100 trees of depth 14 => 105MB
    * I've heard of many cases of users wanting to fit 500-1000 trees and use trees of depth 18-20.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63886/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jodersky <gi...@git.apache.org>.
Github user jodersky commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Agree, it's not very obvious. In the latter document I think a `push` is akin to `append` and `pop` to `head`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r72179527
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Class for queueing nodes to split on each iteration.
    +   * This is a FILO queue.
    +   */
    +  private[impl] class NodeQueue {
    +    private var q: mutable.MutableList[(Int, LearningNode)] =
    +      mutable.MutableList.empty[(Int, LearningNode)]
    +
    +    /** Put (treeIndex, node) into queue. Constant time. */
    +    def put(treeIndex: Int, node: LearningNode): Unit = {
    +      q += ((treeIndex, node))
    +    }
    +
    +    /** Remove and return last inserted element.  Linear time (unclear in Scala docs though) */
    --- End diff --
    
    Note: This is not ideal, but it should be insignificant compared to the cost of communicating the trees.  I am not aware of an existing solution for a stack with constant-time push/pop in Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jodersky <gi...@git.apache.org>.
Github user jodersky commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Some comments still refer to the use of queue and should be updated. Other than that, the data structure part now looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r72317443
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Class for queueing nodes to split on each iteration.
    +   * This is a FILO queue.
    +   */
    +  private[impl] class NodeQueue {
    +    private var q: mutable.MutableList[(Int, LearningNode)] =
    +      mutable.MutableList.empty[(Int, LearningNode)]
    +
    +    /** Put (treeIndex, node) into queue. Constant time. */
    +    def put(treeIndex: Int, node: LearningNode): Unit = {
    +      q += ((treeIndex, node))
    +    }
    +
    +    /** Remove and return last inserted element.  Linear time (unclear in Scala docs though) */
    --- End diff --
    
    Is there some reason not to use `scala.collection.mutable.Stack[(Int, LearningNode)]` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Ack. I'll review it and run tests tonight. Is it targeting 2.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r72338956
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1110,4 +1124,40 @@ private[spark] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Class for queueing nodes to split on each iteration.
    +   * This is a FILO queue.
    +   */
    +  private[impl] class NodeQueue {
    +    private var q: mutable.MutableList[(Int, LearningNode)] =
    +      mutable.MutableList.empty[(Int, LearningNode)]
    +
    +    /** Put (treeIndex, node) into queue. Constant time. */
    +    def put(treeIndex: Int, node: LearningNode): Unit = {
    +      q += ((treeIndex, node))
    +    }
    +
    +    /** Remove and return last inserted element.  Linear time (unclear in Scala docs though) */
    --- End diff --
    
    I also somehow didn't find Stack via Google...  Does anyone know of documentation for the performance of Stack?  I don't see it in the Scala docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Thanks @jodersky   I saw those, but the first does not document computational cost & the latter does not really clarify what I need for stacks (push and pop).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14359


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Not urgent, but I'd like it to be in 2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    **[Test build #64020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64020/consoleFull)** for PR 14359 at commit [`133fdbf`](https://github.com/apache/spark/commit/133fdbf9972745007eee4e68ffff703b3c9ad68e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14359: [SPARK-16719][ML] Random Forests should communica...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14359#discussion_r79679122
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -161,31 +161,42 @@ private[spark] object RandomForest extends Logging {
           None
         }
     
    -    // FIFO queue of nodes to train: (treeIndex, node)
    -    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    /*
    +      Stack of nodes to train: (treeIndex, node)
    +      The reason this is FILO is that we train many trees at once, but we want to focus on
    --- End diff --
    
    "The reason this is FILO" -> "The reason we use a stack"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org