You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yakov Kerzhner (Jira)" <ji...@apache.org> on 2021/02/23 14:55:00 UTC
[jira] [Comment Edited] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

    [ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289113#comment-17289113 ] 

Yakov Kerzhner edited comment on SPARK-34448 at 2/23/21, 2:54 PM:
------------------------------------------------------------------

As I said in the description, I do not believe that the starting point should cause this bug; the minimizer should still drift to the proper minimum.  I said the fact that the log(odds) was made the starting point seems to suggest that whoever wrote the code believed that the intercept should be close to the log(odds), which is only true if the data is centered.  If I had to guess, I would guess that there is something in the objective function that pulls the intercept towards the log(odds).  This would be a bug, as the log(odds) is a good approximation for the intercept if and only if the data is centered.  For non-centered data, it is completely wrong to have the intercept equal (or be close to) the log(odds).  My test shows precisely this, that when the data is not centered, spark still returns an intercept equal to the log(odds) (test 2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an intercept almost equal to the log(odds), (test 1.b. log(odds): -3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So we need to dig into the objective function, and whether somewhere in there is a term that penalizes the intercept moving away from the log(odds).   If there is nothing there of this sort, then a step through of the minimization process should shed some clues as to why the intercept isnt budging from the initial value given.


was (Author: ykerzhner):
As I said in the description, I do not believe that the starting point should cause this bug; the minimizer should still drift to the proper minimum.  I said the fact that the log(odds) was made the starting point seems to suggest that whoever wrote the code believed that the intercept should be close to the log(odds), which is only true if the data is centered.  If I had to guess, I would guess that there is something in the objective function that pulls the intercept towards the log(odds).  This would be a bug, as the log(odds) is a good approximation for the intercept if and only if the data is centered.  For non-centered data, it is completely wrong to have the intercept equal (or be close to) the log(odds).  My test shows precisely this, that when the data is not centered, spark still returns an intercept equal to the log(odds) (test 2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an intercept almost equal to the log(odds), (test 1.b. log(odds): -3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So we need to dig into the objective function, and whether somewhere in there is a term that penalizes the intercept moving away from the log(odds). 

> Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34448
>                 URL: https://issues.apache.org/jira/browse/SPARK-34448
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Yakov Kerzhner
>            Priority: Major
>              Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary logistic regression contains a bug that pulls the intercept value towards the log(odds) of the target data.  This is mathematically only correct when the data comes from distributions with zero means.  In general, this gives incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this point, and so this heuristic is incorrect.  But an incorrect starting point does not explain this bug.  The minimizer should drift to the correct place.  I was not able to find the code of the actual objective function that is being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org