You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/10 07:28:29 UTC

[GitHub] [spark] mengxr commented on pull request #31693: [SPARK-34448][ML] Binary logistic regression incorrectly computes the intercept and coefficients with small var features

mengxr commented on pull request #31693:
URL: https://github.com/apache/spark/pull/31693#issuecomment-795017955

This is my understanding of the behavior. Because we didn't center the columns, when there is a near-constant column, after std scaling the values become very large. As a result, the coefficient corresponding to that column is very small, insensitive to regularization if any. The algorithm used to rely on regularization to push the extra weights to intercept, but now ineffective. So the weight can shift freely between the intercept and the feature weight.

This PR does the "virtual" centering and we should calculate the initial intercept after centering. Those are good changes.

However, I'm not sure if it is necessary to backport the changes to old releases. Because the old approach might still produce a "correct" model in terms of making similar predictions, although the coefficients might converge slowly. I asked @zhengruifeng to test it. If it is not a correctness bug, we might save the effort of backporting the change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org