You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/03/31 21:46:41 UTC

[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

    [ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951685#comment-15951685 ] 

Joseph K. Bradley commented on SPARK-9478:
------------------------------------------

[~clamus] The current vote is to *not use* weights during sampling and then to *use* weights when growing the trees.  That will simplify the sampling process so we hopefully won't have to deal with the complexity you're mentioning.  Note that we'll have to weight the trees in the forest to make this approach work.

I'm also guessing that it will give better calibrated probability estimates in the final forest, though this is based on intuition rather than analysis.  E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine:
* If we use weights during sampling but not when growing trees...
** Say we want 10 trees.  We pick 10 sets of 4 rows.  The probability of always picking the weight-1000 row is ~0.89.
** So our forest will probably give us 0/1 (poorly calibrated) probabilities.
* If we do not use weights during sampling but use them when growing trees... (current proposal)
** Say we want 10 trees.
** The probability of always picking the weight-1 rows is ~1e-5.  This means we'll have at least one tree with the weight-1000 row, so it will dominate our predictions (giving good accuracy).
** The probability of having at least 1 tree with only weight-1 rows is ~0.02.  This means it's pretty likely we'll have some tree predicting label1, so we'll keep our probability predictions away from 0 and 1.

This is really hand-wavy, but it does alleviate my fears of having extreme log losses.  On the other hand, maybe it could be handle by adding smoothing to predictions...

> Add sample weights to Random Forest
> -----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample (instance) weights. Weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold).  Sample weights generalize class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org