You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wayne Zhang (JIRA)" <ji...@apache.org> on 2016/12/04 20:01:00 UTC

[jira] [Updated] (SPARK-18701) Poisson GLM fails due to wrong initialization

     [ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wayne Zhang updated SPARK-18701:
--------------------------------
      Shepherd: Sean Owen  (was: sean corkum)
    Issue Type: Bug  (was: New Feature)

> Poisson GLM fails due to wrong initialization
> ---------------------------------------------
>
>                 Key: SPARK-18701
>                 URL: https://issues.apache.org/jira/browse/SPARK-18701
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.2
>            Reporter: Wayne Zhang
>            Priority: Critical
>             Fix For: 2.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Poisson GLM fails for many standard data sets. The issue is incorrect initialization leading to almost zero probability and weights. The following simple example reproduces the error. 
> {code:borderStyle=solid}
> val datasetPoissonLogWithZero = Seq(
>       LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>       LabeledPoint(1.0, Vectors.dense(12, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>       LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>       LabeledPoint(1.0, Vectors.dense(16, 1.0)),
>       LabeledPoint(0.0, Vectors.dense(10, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>       LabeledPoint(0.0, Vectors.dense(13, 0.0)),
>       LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>       LabeledPoint(1.0, Vectors.dense(12, 2.0))
>     ).toDF()
>     
> val glr = new GeneralizedLinearRegression()
>   .setFamily("poisson")
>   .setLink("log")
>   .setMaxIter(20)
>   .setRegParam(0)
> val model = glr.fit(datasetPoissonLogWithZero)
> {code}
> The issue is in the initialization:  the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. The fix is easy: just add a small constant, highlighted in red below. 
>  
>     override def initialize(y: Double, weight: Double): Double = {
>       require(y >= 0.0, "The response variable of Poisson family " +
>         s"should be non-negative, but got $y")
>       y {color:red}+ 0.1 {color}
>     }
> I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org