You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Camilo Lamus (JIRA)" <ji...@apache.org> on 2017/02/16 18:50:41 UTC

[jira] [Comment Edited] (SPARK-9478) Add sample weights to Random Forest

    [ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870490#comment-15870490 ] 

Camilo Lamus edited comment on SPARK-9478 at 2/16/17 6:50 PM:
--------------------------------------------------------------

[~sethah] It is very exiting that you guys are working on adding a weighted version of the random forest. I am really looking forward to use it in spark ml RF and other algorithms. As [~josephkb] mention, adding weights to data points (samples/instances) has a myriad of application in data analysis.

I have a question about the way you are thinking on using the weights. Are you thinking on using the weights both in the bootstrap sample step as well as in growing the trees? Using it in both steps might make the weights overly “important”.

If you are using it in the tree growing process, are you doing something like what is shown here in slide 4 ( http://www.stat.cmu.edu/~ryantibs/datamining/lectures/25-boost.pdf )?

In the case where you would use the weights in constructing the bootstrap samples, as you mention, the distribution of the (marginal) counts each data point is selected in a bootstrap sample is binomial. However, the joint distribution of the counts is multinomial. Specifically, If you draw N samples with replacement from the original N data points, selecting each with probability p_i = 1/ N, the joint distribution is Multinomial(N, p_i = 1/N, i=1,2,…,N), and this is not the same as drawing independently N times from Binomial(N, 1/N). For one thing, you might end up with more or less than N samples. In regard to the poisson approximation, I think this might be more problematic since I think it requires one of the counts to dominate (i.e, happen with high probability) (see here: http://www.jstor.org/stable/3314676?seq=1#page_scan_tab_contents). This is a theoretical issue, which might not be matter in practice. But who know, it might. And after all, it might be just better to get the counts from Multinomial(N, p_i = w_i / sum(w_j)).

Either way, if the poisson approximation is good enough, it does make more sense to use what you suggest at the end, which is to sample from Poisson(lambda_i = N  w_i / sum(w_j)). Sampling from Poisson(lambda_i = 1), and then multiply by  N  w_i / sum(w_j) can worsen the Poisson approximation to the binomial since the variance of multiplied version is lambda_i^2, and not lambda_i, as it should be a poisson rv.



was (Author: clamus):
[~sethah] It is very exiting that you guys are working on adding a weighted version of the random forest. I am really looking forward to use it in spark ml RF and other algorithms. As [~josephkb] mention, adding weights to data points (samples/instances) has a myriad of application in data analysis.

I have a question about the way you are thinking on using the weights. Are you thinking on using the weights both in the bootstrap sample step as well as in growing the trees? Using it in both steps might make the weights overly “important”.

If you are using it in the tree growing process, are you doing something like what is shown here in slide 4 (http://www.stat.cmu.edu/~ryantibs/datamining/lectures/25-boost.pdf)?

In the case where you would use the weights in the constructing the bootstrap samples, as you mention, the distribution of the (marginal) counts each data point is selected in a bootstrap sample is binomial. However, the joint distribution of the counts is multinomial. Specifically, If you draw N samples with replacement from the original N data points, selecting each with probability p_i = 1/ N, the joint distribution is Multinomial(N, p_i = 1/N, i=1,2,…,N), and this is not the same as drawing independently N times from Binomial(N, 1/N). For one thing, you might end up with more or less than N samples. In regard to the poisson approximation, I think this might be more problematic since I think it requires one of the counts to dominate (i.e, happen with high probability) (see here: http://www.jstor.org/stable/3314676?seq=1#page_scan_tab_contents). This is a theoretical issue, which might not be matter in practice. But who know, it might. And after all, it might be just better to get the counts from Multinomial(N, p_i = w_i / sum(w_j)).

Either way, if the poisson approximation is good enough, it does make more sense to use what you suggest at the end, which is to sample from Poisson(lambda_i = N  w_i / sum(w_j)). Sampling from Poisson(lambda_i = 1), and then multiply by  N  w_i / sum(w_j) can worsen the Poisson approximation to the binomial since the variance of multiplied version is lambda_i^2, and not lambda_i, as it should be a poisson rv.


> Add sample weights to Random Forest
> -----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org