You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Kuhlen (JIRA)" <ji...@apache.org> on 2015/04/13 07:10:13 UTC

[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

    [ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491919#comment-14491919 ] 

Michael Kuhlen commented on SPARK-3727:
---------------------------------------

Hello!

I've implemented predictWithProbabilities() methods for DecisionTreeModel and treeEnsembleModels in scala. These methods return both the most likely class as well as the probabilities of each of the classes. As in scikit-learn, the probabilities are defined as the "mean predicted class probabilities of the trees in the forest\[, where the\] class probability of a single tree is the fraction of samples of the same class in a leaf." ([sklearn.ensemble.RandomForestClassifier.predict_proba|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba])

My approach was to modify the Predict class to hold the class probabilities for all classes (as opposed to just of the majority class), and I utilize these probabilities to determine the means over all trees. I believe this should work for GBTrees as well, since I'm taking care to weight the probabilities by the weight of each tree (=1.0 for RandomForest).

Here's a [link to my fork|https://github.com/apache/spark/compare/master...mqk:master] showing my modifications. I would be happy to issue a pull request for these changes, if that would be of interest to the community. Although I haven't done so yet, I believe it should be straightforward to extend this to also calculate the variance of estimates for regression algorithms, as suggested in this ticket.

Best, 

Mike


> DecisionTree, RandomForest: More prediction functionality
> ---------------------------------------------------------
>
>                 Key: SPARK-3727
>                 URL: https://issues.apache.org/jira/browse/SPARK-3727
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression.  Other info about predictions would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org