You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:21:21 UTC

[jira] [Updated] (SPARK-17476) Proper handling for unseen labels in logistic regression training.

     [ https://issues.apache.org/jira/browse/SPARK-17476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-17476:
---------------------------------
    Labels: bulk-closed  (was: )

> Proper handling for unseen labels in logistic regression training.
> ------------------------------------------------------------------
>
>                 Key: SPARK-17476
>                 URL: https://issues.apache.org/jira/browse/SPARK-17476
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Seth Hendrickson
>            Priority: Major
>              Labels: bulk-closed
>
> Now that logistic regression supports multiclass, it is possible to train on data that has {{K}} classes, but one or more of the classes does not appear in training. For example,
> {code}
> (0.0, x1)
> (2.0, x2)
> ...
> {code}
> Currently, logistic regression assumes that the outcome classes in the above dataset have three levels: {{0, 1, 2}}. Since label 1 never appears, it should never be predicted. In theory, the coefficients should be zero and the intercept should be negative infinity. This can cause problems since we center the intercepts after training.
> We should discuss whether or not the intercepts actually tend to -infinity in practice, and whether or not we should even include them in training. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org