You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/05/19 04:12:59 UTC

[jira] [Commented] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

    [ https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549667#comment-14549667 ] 

Joseph K. Bradley commented on SPARK-7685:
------------------------------------------

+1
This should probably be done in the Pipelines API, where we can add a "weight: Double" column (HasWeight shared parameter) instead of modifying LabeledPoint.
LogisticRegression is a natural place to start supporting weights, but I hope we can add weight support almost everywhere before too long.

> Handle high imbalanced data and apply weights to different samples in Logistic Regression
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-7685
>                 URL: https://issues.apache.org/jira/browse/SPARK-7685
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: DB Tsai
>
> In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org