You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/09/18 00:12:33 UTC

[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training

     [ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley updated SPARK-3161:
-------------------------------------
    Description: 
Improvement: worker computation

When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level).  This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level.

Proposal: Cache this mapping.
* Pro: O(1) lookup instead of O(level).
* Con: Extra RDD which must share the same partitioning as the training data.

Design:
* (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree).
* (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances.  This has more overhead in data structures but could be more efficient since not all nodes are split on each iteration.

  was:
Improvement: worker computation

When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level).  This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level.

Proposal: Cache this mapping.
* Pro: O(1) lookup instead of O(level).
* Con: Extra RDD which must share the same partitioning as the training data.



> Cache example-node map for DecisionTree training
> ------------------------------------------------
>
>                 Key: SPARK-3161
>                 URL: https://issues.apache.org/jira/browse/SPARK-3161
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level).  This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances.  This has more overhead in data structures but could be more efficient since not all nodes are split on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org