You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/09/05 20:59:29 UTC

[jira] [Updated] (SPARK-3160) Simplify DecisionTree data structure for training

     [ https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley updated SPARK-3160:
-------------------------------------
    Description: 
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.  For this, we could have a “LearningNode extends Node” setup where the LearningNode holds metadata for learning (such as impurities).  The test-time model could be extracted from this training-time model, so that extra information (such as impurities) does not have to be kept after training.

This would let us eliminate the flat array of nodes, thus saving storage when we do not grow a full tree.  It would also potentially make it easier to pass subtrees to compute nodes for local training.


  was:
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.  For this, we could have a “LearningNode extends Node” setup where the LearningNode holds metadata for learning (such as impurities).  The test-time model could be extracted from this training-time model, so that extra information (such as impurities) does not have to be kept after training.



> Simplify DecisionTree data structure for training
> -------------------------------------------------
>
>                 Key: SPARK-3160
>                 URL: https://issues.apache.org/jira/browse/SPARK-3160
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Improvement: code clarity
> Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array.
> Proposed fix: Maintain everything within a growing tree structure.  For this, we could have a “LearningNode extends Node” setup where the LearningNode holds metadata for learning (such as impurities).  The test-time model could be extracted from this training-time model, so that extra information (such as impurities) does not have to be kept after training.
> This would let us eliminate the flat array of nodes, thus saving storage when we do not grow a full tree.  It would also potentially make it easier to pass subtrees to compute nodes for local training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org