You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Manoj Kumar (JIRA)" <ji...@apache.org> on 2016/06/14 05:01:57 UTC

[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning

    [ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328939#comment-15328939 ] 

Manoj Kumar edited comment on SPARK-3155 at 6/14/16 5:01 AM:
-------------------------------------------------------------

1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the tree to full depth and remove the leaves according to validation data.

However, if you feel that #14351 is more important, I can focus on that.


was (Author: mechcoder):
1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the train to full depth and remove the leaves according to validation data.

However, if you feel that #14351 is more important, I can focus on that.

> Support DecisionTree pruning
> ----------------------------
>
>                 Key: SPARK-3155
>                 URL: https://issues.apache.org/jira/browse/SPARK-3155
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision trees.  A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways.  DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for each validation example.  This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained.  Whenever two children increase the validation error, they are pruned, and no more training is required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is important when using a tree directly for prediction.  It is less important when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org