You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sung Chung (JIRA)" <ji...@apache.org> on 2014/10/14 01:49:33 UTC

[jira] [Commented] (SPARK-3161) Cache example-node map for DecisionTree training

    [ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170254#comment-14170254 ] 

Sung Chung commented on SPARK-3161:
-----------------------------------

(option 2) sounds intriguing but is this easy to do with RDD particularly with fault-tolerance? This sounds like the entire thing must be in memory (in the beginning Map[0, Array[TreePoint]] will contain every single point) - so if you don't have enough memory to fit the whole partition in memory, wouldn't work. And as lineage builds up, it sounds like lots of memory allocation and garbage collection will be required, as new Array[TreePoint] would have to be allocated/deallocated.

(option 1) is straightforward, but yea, there're wasted lookups for irrelevant data points.

> Cache example-node map for DecisionTree training
> ------------------------------------------------
>
>                 Key: SPARK-3161
>                 URL: https://issues.apache.org/jira/browse/SPARK-3161
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level).  This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances.  This has more overhead in data structures but could be more efficient: not all nodes are split on each iteration, and this would allow each executor to ignore instances which are not used for the current node set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org