You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuewei Na (JIRA)" <ji...@apache.org> on 2016/06/22 20:28:16 UTC

[jira] [Comment Edited] (SPARK-9478) Add class weights to Random Forest

    [ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345105#comment-15345105 ] 

Yuewei Na edited comment on SPARK-9478 at 6/22/16 8:28 PM:
-----------------------------------------------------------

Hi [~sethah], thanks a lot for your comment on my PR and your continual concerns on this problem. Sorry for not commenting before I made this PR. Like what you said, the major reason for me to make another PR is exactly because of the title of this JIRA. 

I implement this class weight version instead of sticking to instance weight because:
1. Existing implementations in other languages or packages, e.g. rpart in R and sklearn in Python all support class weights instead of instance weights. Indeed, instance weights make weighting in regression also possible. But the major application in handling imbalanced dataset is classification. If one does need such feature, it could be done by downsampling or upsampling the whole dataset. For the materials that I've read, including the book 'Elements of Statistical Learning', Rpart's documentation(https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf) and some professor's PPT. I've never seen the use cases for handling imbalanced dataset in regression problems using Random Forest. I would be very happy if someone could tell me under what circumstances it's needed.

2. As you commented in the first PR, the instance weight implementation makes 'minInstancesPerNode' feature in trouble. The class weight implementation has no such issue, which will make the code more stable because very few inner modifications are needed.


was (Author: vincentna):
Hi [~sethah], thanks a lot for your comment on my PR and your continual concerns on this problem. Sorry for not commenting before I made this PR. Like what you said, the major reason for me to make another PR is exactly because of the title of this JIRA. 

I implement this class weight version instead of sticking to instance weight because:
1. Existing implementations in other languages or packages, e.g. rpart in R and sklearn in Python all support class weights instead of instance weights. Indeed, instance weights make weighting in regression also possible. But the major application in handling imbalanced dataset is classification. If one does need such feature, it could be done by downsampling or upsampling the whole dataset. For the materials that I've read, including the book 'Elements of Statistical Learning', Rpart's documentation(https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf) and some professor's PPT. I've never seen the use cases for handling imbalanced dataset in regression problems using Random Forest. I would be very happy if someone could tell me when it's needed.

2. As you commented in the first PR, the instance weight implementation makes 'minInstancesPerNode' feature in trouble. The class weight implementation has no such issue, which will make the code more stable because very few inner modifications are needed.

> Add class weights to Random Forest
> ----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org