You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/09/11 18:28:33 UTC
[jira] [Updated] (SPARK-3158) Avoid 1 extra aggregation for
DecisionTree training
[ https://issues.apache.org/jira/browse/SPARK-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-3158:
-------------------------------------
Description:
Improvement: computation
Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).
This update could be done by:
* allocating a root node before the loop in the main train() method
* allocating nodes for level L+1 while choosing splits for level L
* caching stats in these newly allocated nodes, so that we can calculate predictions if we know they will be leaves
* DecisionTree.findBestSplits can just return doneTraining
This will let us cache impurity and avoid re-calculating it in calculateGainForSplit.
Some above notes were copied from discussion in [https://github.com/apache/spark/pull/2341]
was:
Improvement: computation
Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).
> Avoid 1 extra aggregation for DecisionTree training
> ---------------------------------------------------
>
> Key: SPARK-3158
> URL: https://issues.apache.org/jira/browse/SPARK-3158
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Joseph K. Bradley
> Priority: Minor
>
> Improvement: computation
> Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).
> This update could be done by:
> * allocating a root node before the loop in the main train() method
> * allocating nodes for level L+1 while choosing splits for level L
> * caching stats in these newly allocated nodes, so that we can calculate predictions if we know they will be leaves
> * DecisionTree.findBestSplits can just return doneTraining
> This will let us cache impurity and avoid re-calculating it in calculateGainForSplit.
> Some above notes were copied from discussion in [https://github.com/apache/spark/pull/2341]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org