You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Duc Trung NGUYEN (JIRA)" <ji...@apache.org> on 2015/07/20 11:20:04 UTC

[jira] [Created] (SPARK-9190) CLONE - Support DecisionTree pruning

Duc Trung NGUYEN created SPARK-9190:
---------------------------------------

             Summary: CLONE - Support DecisionTree pruning
                 Key: SPARK-9190
                 URL: https://issues.apache.org/jira/browse/SPARK-9190
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Duc Trung NGUYEN


Improvement: accuracy, computation

Summary: Pruning is a common method for preventing overfitting with decision trees.  A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways.  DecisionTree does not currently support pruning.

Pruning:  A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed.
A naive implementation prunes as follows:
(1) Train a depth K tree using a training set.
(2) Compute the optimal prediction at each node (including internal nodes) based on the training set.
(3) Take a held-out validation set, and use the tree to make predictions for each validation example.  This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).)
(4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions.  Remove the leafs if the parent has lower error.

A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained.  Whenever two children increase the validation error, they are pruned, and no more training is required on that branch.

It is common to use about 1/3 of the data for pruning.  Note that pruning is important when using a tree directly for prediction.  It is less important when combining trees via ensemble methods.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org