You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2014/10/18 02:37:33 UTC

[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training

     [ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiangrui Meng updated SPARK-3161:
---------------------------------
    Assignee: Sung Chung

> Cache example-node map for DecisionTree training
> ------------------------------------------------
>
>                 Key: SPARK-3161
>                 URL: https://issues.apache.org/jira/browse/SPARK-3161
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Sung Chung
>
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level).  This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances.  This has more overhead in data structures but could be more efficient: not all nodes are split on each iteration, and this would allow each executor to ignore instances which are not used for the current node set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org