You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "xujiajin (Jira)" <ji...@apache.org> on 2020/05/24 15:01:00 UTC
[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

    [ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115343#comment-17115343 ] 

xujiajin commented on SPARK-3159:
---------------------------------

it is possible to control the prune parameter during training decision tree model?

I need to use the Probability value of decision tree, according the source code, The prune parameter controls whether to merge nodes with the same prediction. Although the prune parameter does not affect the prediction, it does affect the probability. The default value of the prune parameter is true and cannot be changed. below is the source code:
{code:java}
public Node toNode(boolean prune) {
    Object var10000;
    if (this.leftChild().isEmpty() && this.rightChild().isEmpty()) {
        var10000 = this.stats().valid() ? new LeafNode(this.stats().impurityCalculator().predict(), this.stats().impurity(), this.stats().impurityCalculator()) : new LeafNode(this.stats().impurityCalculator().predict(), -1.0D, this.stats().impurityCalculator());
    } else {
        Object var7;
        label50: {
            .MODULE$.assert(this.leftChild().nonEmpty() && this.rightChild().nonEmpty() && this.split().nonEmpty() && this.stats() != null, new Serializable(this) {
                public static final long serialVersionUID = 0L;

                public final String apply() {
                    return "Unknown error during Decision Tree learning.  Could not convert LearningNode to Node.";
                }
            });
            Tuple2 var2 = new Tuple2(((LearningNode)this.leftChild().get()).toNode(prune), ((LearningNode)this.rightChild().get()).toNode(prune));
            if (var2 != null) {
                Node l = (Node)var2._1();
                Node r = (Node)var2._2();
                if (l instanceof LeafNode) {
                    LeafNode var5 = (LeafNode)l;
                    if (r instanceof LeafNode) {
                        LeafNode var6 = (LeafNode)r;
                        if (prune && var5.prediction() == var6.prediction()) {
                            var7 = new LeafNode(var5.prediction(), this.stats().impurity(), this.stats().impurityCalculator());
                            break label50;
                        }
                    }
                }
            }

            if (var2 == null) {
                throw new MatchError(var2);
            }

            Node l = (Node)var2._1();
            Node r = (Node)var2._2();
            var7 = new InternalNode(this.stats().impurityCalculator().predict(), this.stats().impurity(), this.stats().gain(), l, r, (Split)this.split().get(), this.stats().impurityCalculator());
        }

        var10000 = var7;
    }

    return (Node)var10000;
}
{code}
The following is an example of the effect of prune parameter on probability: The following graph shows the tree structure when MinInstancesPerNode is equal to 29, when MinInstancesPerNode is equal to 30, the decision tree "feature2 <=6.15" node will be deleted because all the predicted values of the children under "feature2 <=6.15" node are the same. But this result affects the probability more. 

!image-2020-05-24-23-00-38-419.png!

> Check for reducible DecisionTree
> --------------------------------
>
>                 Key: SPARK-3159
>                 URL: https://issues.apache.org/jira/browse/SPARK-3159
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Alessandro Solimando
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same prediction.  This happens since the splitting criterion (e.g., Gini) is not the same as prediction accuracy/MSE; the splitting criterion can sometimes be improved even when both children would still output the same prediction (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org