You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/12/05 18:42:58 UTC

[jira] [Assigned] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM

     [ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-18715:
------------------------------------

    Assignee:     (was: Apache Spark)

> Fix wrong AIC calculation in Binomial GLM
> -----------------------------------------
>
>                 Key: SPARK-18715
>                 URL: https://issues.apache.org/jira/browse/SPARK-18715
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.2
>            Reporter: Wayne Zhang
>            Priority: Critical
>              Labels: patch
>             Fix For: 2.2.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The AIC calculation in Binomial GLM seems to be wrong when there are weights. The result is different from that in R.
> The current implementation is:
> {code}
>       -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
>         weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
>       }.sum()
> {code} 
> Suggest changing this to 
> {code}
>       -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
>         val wt = math.round(weight).toInt
>         if (wt == 0){
>           0.0
>         } else {
>           dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>         }
>       }.sum()
> {code} 
> ----
> ----
> The following is an example to illustrate the problem.
> {code}
> val dataset = Seq(
>       LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>       LabeledPoint(0.5, Vectors.dense(12, 0.0)),
>       LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>       LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>       LabeledPoint(0.5, Vectors.dense(16, 1.0))
>     ).toDF().withColumn("weight", col("label") + 1.0)
> val glr = new GeneralizedLinearRegression()
>     .setFamily("binomial")
>     .setWeightCol("weight")
>     .setRegParam(0)
> val model = glr.fit(dataset)
> model.summary.aic
> {code}
> This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. 
> {code}
> da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
> 0,18,1,1
> 0.5,12,0,1.5
> 1,15,0,2
> 0,13,2,1
> 0,15,1,1
> 0.5,16,1,1.5
> da <- as.data.frame(da)
> f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
> AIC(f)
> -2 * logLik(f)
> {code}
> Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R.
> {code}
> val predictions = model.transform(dataset)
> -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) =>
>       val wt = math.round(weight).toInt
>       if (wt == 0){
>         0.0
>       } else {
>         dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>       }
>   }.sum()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org