You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2016/04/28 09:18:12 UTC
[jira] [Comment Edited] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

    [ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261655#comment-15261655 ] 

Nick Pentreath edited comment on SPARK-8971 at 4/28/16 7:18 AM:
----------------------------------------------------------------

I think it would be good to have something implemented, so if that means doing it with RDD initially that's fine by me.

For your questions
1 - I'd still like to see if this same approach could be used for recommendation/ranking style settings, so allowing the user to specify the column would be good.
2 / 3 - I agree it makes most sense to respect trainRatio. The idea is to maintain the class distribution rather than allow different trainRatios effectively between splits. So I vote for exact sampling as you suggest
4 - for now no, but I would imagine the main use case for this is for class labels, in which case we can use column metadata (now or in the future) to get the labels?

As for API design, I'm not sure what you mean by "output column" in your first example?

I would go for the `stratifiedCol` approach personally.


was (Author: mlnick):
I think it would be good to have something implemented, so if that means doing it with RDD initially that's fine by me.

For you questions
1 - I'd still like to see if this same approach could be used for recommendation/ranking style settings, so allowing the user to specify the column would be good.
2 / 3 - I agree it makes most sense to respect trainRatio. The idea is to maintain the class distribution rather than allow different trainRatios effectively between strata. So I vote for exact sampling as you suggest
4 - for now no, but I would imagine the main use case for this is for class labels, in which case we can use column metadata (now or in the future) to get the labels?

As for API design, I'm not sure what you mean by "output column" in your first example?

I would go for the `stratifiedCol` approach personally.

> Support balanced class labels when splitting train/cross validation sets
> ------------------------------------------------------------------------
>
>                 Key: SPARK-8971
>                 URL: https://issues.apache.org/jira/browse/SPARK-8971
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Feynman Liang
>            Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are Spark classes which partition data into training and evaluation sets for performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. However, when class probabilities are highly imbalanced (e.g. detection of extremely low-frequency events), random sampling may result in cross validation sets not representative of actual out-of-training performance (e.g. no positive training examples could be included).
> Mainstream R packages like already [caret|http://topepo.github.io/caret/splitting.html] support splitting the data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org