You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Eugene Morozov (JIRA)" <ji...@apache.org> on 2016/03/23 08:29:25 UTC

[jira] [Comment Edited] (SPARK-14043) Remove restriction on maxDepth for decision trees

    [ https://issues.apache.org/jira/browse/SPARK-14043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206841#comment-15206841 ] 

Eugene Morozov edited comment on SPARK-14043 at 3/23/16 7:28 AM:
-----------------------------------------------------------------

I have couple of ideas to mitigate the issue:
- introduce Array64 (int[][] that allows longer arrays, than max_integer) or List, but the bad part is that it'd require a lot of memory just to store those indices,

It looks like the the issue with using array is that most of the indexes are "wasted" - even if some nodes are not split, the array contains elements as if they would be split. This greatly reduces amount of nodes to be actually used in the decision tree. F.e. I've trained couple of models with different maxDepth with 50 trees. All of the decision trees for both models looked like the following couples:
Model with maxDepth = 20:
- depth=20, numNodes=471
- depth=19, numNodes=497

Model with maxDepth = 30:
- depth=30, numNodes=11347
- depth=30, numNodes=10963

Even though the decision trees grows up to the limit of 30 levels, it contains way less number of nodes, than it actually might. 

- Another way to solve this is to represent the decision tree as a tree - it would still allows us to use node indexes, but it won't "waste" them. So, if the node indexes are integers, then the limit is 2^31 - 1 nodes. I'm not sure if that's feasible to achieve, but I'd say it's better to use longs just in case.


was (Author: jean):
I looked at the spark code regarding the issue and I have couple of ideas how this can be fixed
- introduce Array64 (int[][] that allows longer arrays, than max_integer) or List, but the bad part is that it'd require a lot of memory just to store those indices,
- represent the decision tree as a tree without nodeIds at all.

> Remove restriction on maxDepth for decision trees
> -------------------------------------------------
>
>                 Key: SPARK-14043
>                 URL: https://issues.apache.org/jira/browse/SPARK-14043
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> We currently restrict decision trees (DecisionTree, GBT, RandomForest) to be of maxDepth <= 30.  We should remove this restriction to support deep (imbalanced) trees.
> Trees store an index for each node, where each index corresponds to a unique position in a binary tree.  (I.e., the first index of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC)
> With some careful thought, we could probably avoid using indices altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org