You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by actuaryzhang <gi...@git.apache.org> on 2017/01/25 07:12:40 UTC

[GitHub] spark pull request #16699: [SPARK-18710] Add offset in GLM

GitHub user actuaryzhang opened a pull request:

    https://github.com/apache/spark/pull/16699

    [SPARK-18710] Add offset in GLM

    ## What changes were proposed in this pull request?
    Add support for offset in GLM. This is useful for at least two reasons:
    
    1. Account for exposure: e.g., when modeling the number of accidents, we may need to use miles driven as an offset to access factors on frequency.    
    2. Test incremental effects of new variables: we can use predictions from the existing model as offset and run a much smaller model on only new variables. This avoids re-estimating the large model with all variables (old + new) and can be very important for efficient large-scaled analysis. 
    
    ## How was this patch tested?
    New test.
    
    @yanboliang @srowen @felixcheung @sethah 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/actuaryzhang/spark offset

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16699.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16699
    
----
commit 3bf2718c1a1e68273508e63499bb5d1cc8230155
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-01-24T23:46:16Z

    add trait offset

commit 0e240eb313aa91cb645fb3ab8d70e51b6c65b3c7
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-01-24T23:48:03Z

    add offset setter

commit 9c41453a19c0f9c31403fafaf1995c642c37c70d
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-01-25T05:15:50Z

    implement offset in GLM

commit 7823f8af8b0926790816c9e79e9425e503e494ad
Author: actuaryzhang <ac...@gmail.com>
Date:   2017-01-25T06:55:56Z

    add test for glm with offset

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72003/testReport)** for PR 16699 at commit [`d44974c`](https://github.com/apache/spark/commit/d44974cfe50092bb639a31aa7aa9b16eb1d21fae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72672/testReport)** for PR 16699 at commit [`fc64d32`](https://github.com/apache/spark/commit/fc64d32e72e1e406d6f7ab27563ea622aa7b1397).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71995/testReport)** for PR 16699 at commit [`9eca1a6`](https://github.com/apache/spark/commit/9eca1a682d75e20df89ebdb0f1a01e02996f9c7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124236849
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -440,13 +479,13 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
          * The reweight function used to update offsets and weights
    --- End diff --
    
    ```update offsets``` -> ```update working labels```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124403889
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,184 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    +        val trainer = new GeneralizedLinearRegression().setFamily(family)
    +          .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +          .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +        if (family == "tweedie") trainer.setVariancePower(1.5)
    +        val model = trainer.fit(dataset)
    +        val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +        assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    +          s" and fitIntercept = $fitIntercept.")
    +
    +        val familyLink = FamilyAndLink(trainer)
    +        model.transform(dataset).select("features", "offset", "prediction", "linkPrediction")
    +          .collect().foreach {
    +          case Row(features: DenseVector, offset: Double, prediction1: Double,
    +          linkPrediction1: Double) =>
    +            val eta = BLAS.dot(features, model.coefficients) + model.intercept + offset
    +            val prediction2 = familyLink.fitted(eta)
    +            val linkPrediction2 = eta
    +            assert(prediction1 ~= prediction2 relTol 1E-5, "Prediction mismatch: GLM with " +
    +              s"family = $family, and fitIntercept = $fitIntercept.")
    +            assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch: " +
    +              s"GLM with family = $family, and fitIntercept = $fitIntercept.")
    +        }
    +
    +        idx += 1
    +      }
    +    }
    +  }
    +
    +  test("generalized linear regression: predict with no offset") {
    +    val trainData = Seq(
    +      OffsetInstance(2.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(8.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
    +      OffsetInstance(3.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
    +      OffsetInstance(9.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
    +    ).toDF()
    +    val testData = trainData.select("weight", "features")
    +
    +    val trainer = new GeneralizedLinearRegression()
    +      .setFamily("poisson")
    +      .setWeightCol("weight")
    +      .setOffsetCol("offset")
    +      .setLinkPredictionCol("linkPrediction")
    +
    +    val model = trainer.fit(trainData)
    +    model.transform(testData).select("features", "linkPrediction")
    +      .collect().foreach {
    +      case Row(features: DenseVector, linkPrediction1: Double) =>
    +        val linkPrediction2 = BLAS.dot(features, model.coefficients) + model.intercept
    +        assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch")
    +    }
    +  }
    +
    +  test("glm summary: gaussian family with weight and offset") {
         /*
    -       R code:
    +      R code:
     
    -       model <- glm(formula = "b ~ .", family="gaussian", data = df, weights = w)
    -       summary(model)
    +      A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    +      b <- c(17, 19, 23, 29)
    +      w <- c(1, 2, 3, 4)
    +      off <- c(2, 3, 1, 4)
    +      df <- as.data.frame(cbind(A, b))
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(17.0, 1.0, 2.0, Vectors.dense(0.0, 5.0).toSparse),
    +      OffsetInstance(19.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
    +      OffsetInstance(23.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
    +      OffsetInstance(29.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
    +    ).toDF()
    +    /*
    +      R code:
     
    -       Deviance Residuals:
    -           1       2       3       4
    -       1.920  -1.358  -1.109   0.960
    +      model <- glm(formula = "b ~ .", family = "gaussian", data = df,
    +                   weights = w, offset = off)
    +      summary(model)
     
    -       Coefficients:
    -                   Estimate Std. Error t value Pr(>|t|)
    -       (Intercept)   18.080      9.608   1.882    0.311
    -       V1             6.080      5.556   1.094    0.471
    -       V2            -0.600      1.960  -0.306    0.811
    +      Deviance Residuals:
    +            1        2        3        4
    +       0.9600  -0.6788  -0.5543   0.4800
     
    -       (Dispersion parameter for gaussian family taken to be 7.68)
    +      Coefficients:
    +                  Estimate Std. Error t value Pr(>|t|)
    +      (Intercept)   5.5400     4.8040   1.153    0.455
    +      V1           -0.9600     2.7782  -0.346    0.788
    +      V2            1.7000     0.9798   1.735    0.333
     
    -           Null deviance: 202.00  on 3  degrees of freedom
    -       Residual deviance:   7.68  on 1  degrees of freedom
    -       AIC: 18.783
    +      (Dispersion parameter for gaussian family taken to be 1.92)
     
    -       Number of Fisher Scoring iterations: 2
    +          Null deviance: 152.10  on 3  degrees of freedom
    +      Residual deviance:   1.92  on 1  degrees of freedom
    +      AIC: 13.238
     
    -       residuals(model, type="pearson")
    -              1         2         3         4
    -       1.920000 -1.357645 -1.108513  0.960000
    +      Number of Fisher Scoring iterations: 2
     
    -       residuals(model, type="working")
    +      residuals(model, type = "pearson")
    +               1          2          3          4
    +      0.9600000 -0.6788225 -0.5542563  0.4800000
    +      residuals(model, type = "working")
               1     2     3     4
    -       1.92 -0.96 -0.64  0.48
    -
    -       residuals(model, type="response")
    +      0.96 -0.48 -0.32  0.24
    --- End diff --
    
    They don't have the same indentation with the old code but within the new block of code, every line is still aligned. Won't change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124212354
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,28 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    --- End diff --
    
    Add doc ```This is mainly used in GeneralizedLinearRegression currently```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    this is a known issue in test runs currently - it's mentioned in dev@spark.apache.org, just so you know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78812/testReport)** for PR 16699 at commit [`1e47a11`](https://github.com/apache/spark/commit/1e47a11f6a3e1292426ea25e7728b805c6650e1b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @sethah Yes, that is lots of work. However, the only critical change (since the last commit) is on the calculation of the null deviance. The other changes are mainly because of updating existing tests as you suggested (which makes sense and indeed helps to capture a few issues). This requires adding new offsets and updating all inference results... I think it makes sense to just focus on the null deviance part.   
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16699


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72068/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98300999
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -743,6 +743,84 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      library(statmod)
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1]  0.32166185 -0.09698986
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +      [1] -0.17118812  0.31200361 -0.02541656
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(0.0, 0.32166185, -0.09698986),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946),
    +      Vectors.dense(-0.17118812, 0.31200361, -0.02541656))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma", "tweedie")) {
    +        var trainer = new GeneralizedLinearRegression().setFamily(family)
    --- End diff --
    
    just a suggestion: maybe refactor the code below in a method and do fitIntercept.map(fi => family.map(fm => (fi, fm)).zip(expected).foreach(params => callMyMethod(params._1, params._2, params._3))
    then you would get rid of the for loops and not have one long test case and you could remove the idx += 1 below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98170182
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,15 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    --- End diff --
    
    BTW, the new params like `offsetCol` should be added into `Instrumentation.logParams` above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @yanboliang @actuaryzhang this PR breaks the scala-2.10 build:
    ```
    [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala:1117: value absTol is not a member of Int
    [error]     assert(summary.dispersion ~== dispersionR absTol 1E-3)
    [error]                                               ^
    [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala:1232: value absTol is not a member of Int
    ```
    See: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/4721/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71975/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @sethah Thanks much for your review. 
    
    Regarding prediction, both R and my implementation here allow prediction with offsets. If the users want to get the predicted rates (instead of counts), then they can specify the offset to be `lit(0.0)` in the new data set when making predictions. 
    
    The following shows that R prediction includes offsets:
    ```
    A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    b <- c(1, 0, 0.2, 2)
    off <- c(2, 3, 5, 4)
    df <- as.data.frame(cbind(A, b))
    model <- glm(formula = "b ~ .", family = "poisson", data = df, 
                 offset = off)
    p1 <- predict(model, df, type = "response")
    p2 <- as.numeric(exp(cbind(1, A) %*% coef(model) + off))
    sum(p1 - p2)
    ```
    
    This part of the code in the GeneralizedLinearRegression file shows that we do need offset when making the prediction. 
    ```
      protected def predict(features: Vector, offset: Double): Double = {
        val eta = predictLink(features, offset)
        familyAndLink.fitted(eta)
      }
    ```
    
    And this is the test for the prediction:
    ``` 
    val familyLink = FamilyAndLink(trainer)
            model.transform(dataset).select("features", "offset", "prediction", "linkPrediction")
              .collect().foreach {
              case Row(features: DenseVector, offset: Double, prediction1: Double,
              linkPrediction1: Double) =>
                val eta = BLAS.dot(features, model.coefficients) + model.intercept + offset
                val prediction2 = familyLink.fitted(eta)
                val linkPrediction2 = eta
                assert(prediction1 ~= prediction2 relTol 1E-5, "Prediction mismatch: GLM with " +
                  s"family = $family, and fitIntercept = $fitIntercept.")
                assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch: " +
                  s"GLM with family = $family, and fitIntercept = $fitIntercept.")
            }
    ```
    
    Does this make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100976816
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double,
    --- End diff --
    
    Updated the style. 
    I think I'll stick with OffsetInstance. There may be other models in the future that use offset. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124737247
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -303,6 +327,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
       def setWeightCol(value: String): this.type = set(weightCol, value)
     
       /**
    +   * Sets the value of param [[offsetCol]].
    +   * If this is not set or empty, we treat all instance offsets as 0.0.
    +   * Default is not set, so all instances have offset 0.0.
    +   *
    +   * @group setParam
    +   */
    +  @Since("2.2.0")
    --- End diff --
    
    2.2.0 -> 2.3.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124268761
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    --- End diff --
    
    +1 @sethah We can leave it as a follow-up work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71994/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98496017
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) {
    --- End diff --
    
    Could you point out which lines have the spacing issue? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71995/testReport)** for PR 16699 at commit [`9eca1a6`](https://github.com/apache/spark/commit/9eca1a682d75e20df89ebdb0f1a01e02996f9c7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72068 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72068/testReport)** for PR 16699 at commit [`da4174a`](https://github.com/apache/spark/commit/da4174a6c2639001828e587794e1a32dcf8db15d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100974891
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    --- End diff --
    
    I did implement this, but it seems that the order of the values in `GeneralizedLinearRegression.supportedFamilyNames` changes from test to test... I'm not sure why this happened but since it's a minor issues, I just reverted it back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98018738
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -578,6 +578,79 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- c(gaussian, poisson, Gamma)
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma")) {
    +      val trainer = new GeneralizedLinearRegression().setFamily(family)
    +        .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +        .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    --- End diff --
    
    in my code reviews @jkbradley noted that unnecessary parameters should be removed.  It looks like most of these are already defaults and can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124271108
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -578,6 +578,79 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- c(gaussian, poisson, Gamma)
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma")) {
    +      val trainer = new GeneralizedLinearRegression().setFamily(family)
    +        .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +        .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +      val model = trainer.fit(dataset)
    +      val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +      assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    +        s" and fitIntercept = $fitIntercept.")
    +
    +      val familyObj = Family.fromName(family)
    +      val familyLink = new FamilyAndLink(familyObj, familyObj.defaultLink)
    +      model.transform(dataset).select("features", "offset", "prediction", "linkPrediction")
    +        .collect().foreach {
    +          case Row(features: DenseVector, offset: Double, prediction1: Double,
    +          linkPrediction1: Double) =>
    +            val eta = BLAS.dot(features, model.coefficients) + model.intercept + offset
    +            val prediction2 = familyLink.fitted(eta)
    +            val linkPrediction2 = eta
    +            assert(prediction1 ~= prediction2 relTol 1E-5, "Prediction mismatch: GLM with " +
    --- End diff --
    
    Here we use ```~=``` intentionally, since we need to handle exception information by ourselves to make it more clear by printing out family name and whether to fit with intercept.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71974/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100901912
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -944,15 +981,27 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = FamilyAndLink(this)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    +    if (!isSetOffsetCol(this)) {
    --- End diff --
    
    Can we just make this `override protected def predict(features: Vector): Double = predict(features, 0.0)` ?
    
    This predict function is only potentially used by users who extended GLR before this patch, when offset was always 0.0 anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124737056
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -134,6 +134,25 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
       @Since("2.0.0")
       def getLinkPredictionCol: String = $(linkPredictionCol)
     
    +  /**
    +   * Param for offset column name. If this is not set or empty, we treat all instance offsets
    +   * as 0.0. The feature specified as offset has a constant coefficient of 1.0.
    +   * @group param
    +   */
    +  final val offsetCol: Param[String] = new Param[String](this, "offsetCol", "The offset " +
    +    "column name. If this is not set or empty, we treat all instance offsets as 0.0")
    +
    +  /** @group getParam */
    +  def getOffsetCol: String = $(offsetCol)
    --- End diff --
    
    ```@Since("2.3.0")```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Made a new commit that fixes the issues you pointed out. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100896289
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -168,6 +179,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
         }
     
         val newSchema = super.validateAndTransformSchema(schema, fitting, featuresDataType)
    +    if (isSetOffsetCol(this)) SchemaUtils.checkNumericType(schema, $(offsetCol))
    --- End diff --
    
    Are we requiring users to provide an offset when predicting on new data? I don't think we should, i.e. I think we should assume offset is zero if no offset column is provided. More specifically, I think this test should pass:
    
    ````scala
    test("predict with no offset") {
        val trainData = Seq(
          OffsetInstance(2.0, 1.0, 2.0, Vectors.dense(0.0, 5.0).toSparse),
          OffsetInstance(8.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
          OffsetInstance(3.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
          OffsetInstance(9.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
        ).toDF()
        val testData = Seq(
          Instance(2.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
          Instance(8.0, 2.0, Vectors.dense(1.0, 7.0)),
          Instance(3.0, 3.0, Vectors.dense(2.0, 11.0)),
          Instance(9.0, 4.0, Vectors.dense(3.0, 13.0))
        ).toDF()
        val trainer = new GeneralizedLinearRegression()
          .setFamily("poisson")
          .setWeightCol("weight")
          .setOffsetCol("offset")
    
        val model = trainer.fit(trainData)
        model.transform(testData).show()
      }
    ````


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98016201
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -263,17 +288,21 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    +    val instances: RDD[OffsetInstance] =
    +      dataset.select(col($(labelCol)), w, off, col($(featuresCol))).rdd.map {
    +        case Row(label: Double, weight: Double, offset: Double, features: Vector) =>
    +          OffsetInstance(label, weight, offset, features)
           }
     
         val model = if (familyObj == Gaussian && linkObj == Identity) {
           // TODO: Make standardizeFeatures and standardizeLabel configurable.
    +      val wlsInstances: RDD[Instance] = instances.map { instance =>
    +        Instance(instance.label - instance.offset, instance.weight, instance.features)
    --- End diff --
    
    the more I think about this code, it looks like moving the initialization of instances inside the if/else below and creating Instance in one case and OffsetInstance in the other would save some memory/time and wouldn't force you to mess with the Instance case class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124269044
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,184 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
    --- End diff --
    
    ```with offset with weight```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100581520
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -1218,16 +1266,35 @@ class GeneralizedLinearRegressionSummary private[regression] (
        */
       @Since("2.0.0")
       lazy val nullDeviance: Double = {
    -    val w = weightCol
    -    val wtdmu: Double = if (model.getFitIntercept) {
    -      val agg = predictions.agg(sum(w.multiply(col(model.getLabelCol))), sum(w)).first()
    -      agg.getDouble(0) / agg.getDouble(1)
    +    val intercept: Double = if (!model.getFitIntercept) {
    +      0.0
         } else {
    -      link.unlink(0.0)
    +      /*
    +        Estimate intercept analytically when there is no offset, or when there is offset but
    +        the model is Gaussian family with identity link. Otherwise, fit an intercept only model.
    +       */
    +      if (!isSetOffsetCol(model) ||
    +        (isSetOffsetCol(model) && family == Gaussian && link == Identity)) {
    +        val agg = predictions.agg(sum(weight.multiply(
    +          label.minus(offset))), sum(weight)).first()
    +        link.link(agg.getDouble(0) / agg.getDouble(1))
    +      } else {
    +        // Create empty feature column and fit intercept only model using param setting from model
    +        val featureNull = "feature_" + java.util.UUID.randomUUID.toString
    +        val paramMap = model.extractParamMap()
    +        paramMap.put(model.featuresCol, featureNull)
    +        if (family.name != "tweedie") {
    +          paramMap.remove(model.variancePower)
    +        }
    +        val emptyVectorUDF = udf{ () => Vectors.zeros(0) }
    +        model.parent.fit(
    +          dataset.withColumn(featureNull, emptyVectorUDF()), paramMap
    +        ).intercept
    +      }
         }
    -    predictions.select(col(model.getLabelCol).cast(DoubleType), w).rdd.map {
    -      case Row(y: Double, weight: Double) =>
    -        family.deviance(y, wtdmu, weight)
    +    predictions.select(label, offset, weight).rdd.map {
    +      case Row(y: Double, offset: Double, weight: Double) =>
    +        family.deviance(y, link.unlink(intercept + offset), weight)
         }.sum()
       }
    --- End diff --
    
    @sethah This part is the most critical change since last time. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98020407
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -578,6 +578,79 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- c(gaussian, poisson, Gamma)
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma")) {
    +      val trainer = new GeneralizedLinearRegression().setFamily(family)
    +        .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +        .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +      val model = trainer.fit(dataset)
    +      val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +      assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    +        s" and fitIntercept = $fitIntercept.")
    +
    +      val familyObj = Family.fromName(family)
    +      val familyLink = new FamilyAndLink(familyObj, familyObj.defaultLink)
    +      model.transform(dataset).select("features", "offset", "prediction", "linkPrediction")
    +        .collect().foreach {
    +          case Row(features: DenseVector, offset: Double, prediction1: Double,
    +          linkPrediction1: Double) =>
    +            val eta = BLAS.dot(features, model.coefficients) + model.intercept + offset
    +            val prediction2 = familyLink.fitted(eta)
    +            val linkPrediction2 = eta
    +            assert(prediction1 ~= prediction2 relTol 1E-5, "Prediction mismatch: GLM with " +
    --- End diff --
    
    I believe we are encouraged to use the ~== (triple signs)
    At least, that is what I had to use in my PR for RandomForestSuite.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98299104
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,15 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    --- End diff --
    
    +1 on adding it to logParams, good catch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72040/testReport)** for PR 16699 at commit [`e183c08`](https://github.com/apache/spark/commit/e183c08373733249d7b9c476875c2a4a2dff3c05).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @yanboliang @zhengruifeng @srowen 
    Could you guys take a look and let me know if there is any changes needed? Thanks much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98077231
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double,
    +                                   features: Vector) {
    +
    +  /** Constructs from an [[Instance]] object and offset */
    +  def this(instance: Instance, offset: Double = 0.0) = {
    +    this(instance.label, instance.weight, offset, instance.features)
    +  }
    +
    +  /** Converts to an [[Instance]] object by leaving out the offset. */
    +  private[ml] def toInstance: Instance = Instance(label, weight, features)
    --- End diff --
    
    Yes, this is only used once in the current code, and I can get rid of it. But I feel other regression-type models may use offset at some point and having this method will make it easier to switch between Instance and OffsetInstance.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72067/testReport)** for PR 16699 at commit [`58f93af`](https://github.com/apache/spark/commit/58f93af6236d52f87c82411a645cf15413b30b9e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72858/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @hvanhovell I will send a quick fix soon, thanks for your kindly remind.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98580805
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val offset = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) {
    +      lit(0.0)
    +    } else {
    +      col($(offsetCol)).cast(DoubleType)
    +    }
    --- End diff --
    
    I'm following the style in existing code, for example: 
    ```
      lazy val rank: Long = if (model.getFitIntercept) {
        model.coefficients.size + 1
      } else {
        model.coefficients.size
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72067/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100923075
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double,
    --- End diff --
    
    Also, we should be using Spark style indentation:
    
    ````scala
    private[ml] case class OffsetInstance(
        label: Double,
        weight: Double,
        offset: Double,
        features: Vector) {
    ````


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r97876098
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -736,15 +762,27 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    +    if (!isSet(offsetCol) || $(offsetCol).isEmpty) {
    +      val eta = BLAS.dot(features, coefficients) + intercept
    +      familyAndLink.fitted(eta)
    +    } else {
    +      throw new SparkException("Must supply offset value when offset is set.")
    --- End diff --
    
    maybe a better error message might be:
    Must supply offset to predict when offset column is specified


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124739973
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -944,15 +984,22 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = FamilyAndLink(this)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    +    predict(features, 0.0)
    +  }
    +
    +  /**
    +   * Calculates the predicted value when offset is set.
    +   */
    +  def predict(features: Vector, offset: Double): Double = {
    --- End diff --
    
    Mark it private, since we don't support predict on single instance for all models currently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @sethah @imatiach-msft 
    
    Please review the new commit. Main changes:
    - Fix issue in null deviance calculation in the presence of offset. Except for special cases (Gaussian with identity link), we need to re-estimate a intercept only model. The discussions in #16740 are indeed very helpful here. 
    - Updated the test for weighted GLM to include offset as well in order to test the addition of offset thoroughly. 
    
    Please let me know if there is any issue. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72095/testReport)** for PR 16699 at commit [`52bc32b`](https://github.com/apache/spark/commit/52bc32b2d86b2cd5ce092f86ee61f8fe9aebec5d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @actuaryzhang  added a couple comments, please take a look, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72858/testReport)** for PR 16699 at commit [`e95c25b`](https://github.com/apache/spark/commit/e95c25b73682669b65f194141ae08c56deb4d90c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r99263111
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -743,6 +743,84 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      library(statmod)
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1]  0.32166185 -0.09698986
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +      [1] -0.17118812  0.31200361 -0.02541656
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(0.0, 0.32166185, -0.09698986),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946),
    +      Vectors.dense(-0.17118812, 0.31200361, -0.02541656))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma", "tweedie")) {
    +        var trainer = new GeneralizedLinearRegression().setFamily(family)
    +          .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +          .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +        if (family == "tweedie") trainer = trainer.setVariancePower(1.5)
    +        val model = trainer.fit(dataset)
    +        val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +        assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    --- End diff --
    
    We need to be checking more than just the coefficients. For example, the computation of the null deviance does not match R, since the null model computation does not consider the offsets.
    
    Actually, I think we ought to just incorporate offsets into all of the other tests, which will make sure offsets are exhaustively tested. This has been done before e.g. https://github.com/apache/spark/pull/15488, and it _is_ a real pain, but it's probably the best way. I'd be open to other arguments though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72162 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72162/testReport)** for PR 16699 at commit [`59e10f7`](https://github.com/apache/spark/commit/59e10f77f2676a3755536a46d171ee723d226ee5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71974/testReport)** for PR 16699 at commit [`a1f5695`](https://github.com/apache/spark/commit/a1f56952b4ffcdfb400ff0ce014987c50de5f33e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72672/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78727/testReport)** for PR 16699 at commit [`1e47a11`](https://github.com/apache/spark/commit/1e47a11f6a3e1292426ea25e7728b805c6650e1b).
     * This patch **fails PySpark pip packaging tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Not sure what this error msg means, but it seems unrelated to this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @actuaryzhang Sorry for late response. I will review this PR after finishing 2.2 release, thanks for your patience.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98019881
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -578,6 +578,79 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- c(gaussian, poisson, Gamma)
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma")) {
    +      val trainer = new GeneralizedLinearRegression().setFamily(family)
    --- End diff --
    
    it looks like the spacing is off a little here?  the val trainer = ... is at the same level as the for above.  I'm surprised the style checker didn't catch something like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98490824
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) {
    --- End diff --
    
    maybe rename off to offset?  also it looks like the spacing below needs to be fixed (extra indentation).
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) { 
    +      lit(0.0) 
    +    } else { 
    +      col($(offsetCol)).cast(DoubleType) 
    +    } 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @sethah Is there anything else you would recommend for this PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100934554
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    --- End diff --
    
    +1 this is a great idea


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98019350
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -578,6 +578,79 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- c(gaussian, poisson, Gamma)
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946))
    +
    +    import GeneralizedLinearRegression._
    --- End diff --
    
    should this be moved to the imports above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @yanboliang Thanks for the update. 
    This is a big change, although lots of then are on the test side. 
    Let me know if there is anything I can do to help make the review easier. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71994/testReport)** for PR 16699 at commit [`d2afcb0`](https://github.com/apache/spark/commit/d2afcb0c335c133f1a86c8587fd7459ac39935a0).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r97870619
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala ---
    @@ -20,6 +20,7 @@ package org.apache.spark.ml.optim
     import org.apache.spark.internal.Logging
     import org.apache.spark.ml.feature.Instance
     import org.apache.spark.ml.linalg._
    +import org.apache.spark.ml.regression.GLRInstance
    --- End diff --
    
    this is a bit strange - you are using the GLRInstance that you added to specifically GeneralizedLinearRegression.scala in the more generic optimizer IterativelyReweightedLeastSquares.scala.  It doesn't seem right for this file to depend on anything in the regression directory, it should really be the other way around.
    I'm wondering if either:
    1.) We can move the GLRInstance to a more generic place
    2.) OR add it to Instance.scala as a separate case class eg "OffsetInstance"
    3.) OR keep the offset a separate value



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98015797
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -736,15 +765,27 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    +    if (!isSet(offsetCol) || $(offsetCol).isEmpty) {
    +      val eta = BLAS.dot(features, coefficients) + intercept
    +      familyAndLink.fitted(eta)
    +    } else {
    +      throw new SparkException("Must supply offset to predict when offset column is set.")
    +    }
    +  }
    +
    +  /**
    +   * Calculates the predicted value when offset is set.
    +   */
    +  protected def predict(features: Vector, offset: Double): Double = {
    +    val eta = predictLink(features, offset)
         familyAndLink.fitted(eta)
       }
     
       /**
    -   * Calculate the link prediction (linear predictor) of the given instance.
    +   * Calculates the link prediction (linear predictor) of the given instance.
        */
    -  private def predictLink(features: Vector): Double = {
    -    BLAS.dot(features, coefficients) + intercept
    +  private def predictLink(features: Vector, offset: Double): Double = {
    +    BLAS.dot(features, coefficients) + intercept + offset
    --- End diff --
    
    it looks like in the case where family = Gaussian and link = Identity we shouldn't be adding offset?  But then, why would anyone set an offset column... maybe, to make sure this strange case does not happen, we can throw an exception in the validation logic if user accidentally set an offset column in the case that family = Gaussian and link = Identity?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124261785
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -339,15 +364,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
           "GeneralizedLinearRegression was given data with 0 features, and with Param fitIntercept " +
             "set to false. To fit a model with 0 features, fitIntercept must be set to true." )
     
    -    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val w = if (!isSetWeightCol(this)) lit(1.0) else col($(weightCol))
    +    val offset = if (!isSetOffsetCol(this)) lit(0.0) else col($(offsetCol)).cast(DoubleType)
     
         val model = if (familyAndLink.family == Gaussian && familyAndLink.link == Identity) {
           // TODO: Make standardizeFeatures and standardizeLabel configurable.
    +      val instances: RDD[Instance] =
    +        dataset.select(col($(labelCol)), w, offset, col($(featuresCol))).rdd.map {
    +          case Row(label: Double, weight: Double, offset: Double, features: Vector) =>
    +            Instance(label - offset, weight, features)
    +        }
           val optimizer = new WeightedLeastSquares($(fitIntercept), $(regParam), elasticNetParam = 0.0,
             standardizeFeatures = true, standardizeLabel = true)
           val wlsModel = optimizer.fit(instances)
    --- End diff --
    
    What do you think of adding a new interface ```fit(instances: RDD[OffsetInstance])``` for ```WeightedLeastSquares```? Then we can remove some redundant code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    can you please:
    1.) add [MLLIB] to the title
    2.) fix the MiMa tests
    The MiMa tests error seems to be related to the interface change, probably you need to add an exclude to the file "project/MimaExcludes.scala":
    
    [error]  * abstract method getOffsetCol()java.lang.String in trait org.apache.spark.ml.param.shared.HasOffsetCol is inherited by class GeneralizedLinearRegressionBase in current version.
    [error]    filter with: ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOffsetCol.getOffsetCol")
    [error]  * abstract method offsetCol()org.apache.spark.ml.param.Param in trait org.apache.spark.ml.param.shared.HasOffsetCol is inherited by class GeneralizedLinearRegressionBase in current version.
    [error]    filter with: ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOffsetCol.offsetCol")
    [error]  * abstract synthetic method org$apache$spark$ml$param$shared$HasOffsetCol$_setter_$offsetCol_=(org.apache.spark.ml.param.Param)Unit in trait org.apache.spark.ml.param.shared.HasOffsetCol is inherited by class GeneralizedLinearRegressionBase in current version.
    [error]    filter with: ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72040/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71975/testReport)** for PR 16699 at commit [`d071b95`](https://github.com/apache/spark/commit/d071b95ccc94404d37cde7cc122cf8a13fd04449).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71974/testReport)** for PR 16699 at commit [`a1f5695`](https://github.com/apache/spark/commit/a1f56952b4ffcdfb400ff0ce014987c50de5f33e).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124733379
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,29 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + * This is mainly used in GeneralizedLinearRegression currently.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(
    +    label: Double,
    +    weight: Double,
    +    offset: Double,
    +    features: Vector) {
    +
    +  /** Constructs from an [[Instance]] object and offset */
    +  def this(instance: Instance, offset: Double = 0.0) = {
    --- End diff --
    
    Remove it if it was not used anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72863/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100922910
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double,
    --- End diff --
    
    I don't have a strong preference, but maybe calling it `GLMInstance` is clearer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124399685
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -339,15 +364,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
           "GeneralizedLinearRegression was given data with 0 features, and with Param fitIntercept " +
             "set to false. To fit a model with 0 features, fitIntercept must be set to true." )
     
    -    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val w = if (!isSetWeightCol(this)) lit(1.0) else col($(weightCol))
    +    val offset = if (!isSetOffsetCol(this)) lit(0.0) else col($(offsetCol)).cast(DoubleType)
     
         val model = if (familyAndLink.family == Gaussian && familyAndLink.link == Identity) {
           // TODO: Make standardizeFeatures and standardizeLabel configurable.
    +      val instances: RDD[Instance] =
    +        dataset.select(col($(labelCol)), w, offset, col($(featuresCol))).rdd.map {
    +          case Row(label: Double, weight: Double, offset: Double, features: Vector) =>
    +            Instance(label - offset, weight, features)
    +        }
           val optimizer = new WeightedLeastSquares($(fitIntercept), $(regParam), elasticNetParam = 0.0,
             standardizeFeatures = true, standardizeLabel = true)
           val wlsModel = optimizer.fit(instances)
    --- End diff --
    
    I would suggest we leave `WeightedLeastSquares` as is, since it is a general purpose optimization tool and offset is more specific to GLM. I have not seen a weighted least squares implementation that supports offset. 
    
    We discussed something relevant above [here](https://github.com/apache/spark/pull/16699/files/d44974cfe50092bb639a31aa7aa9b16eb1d21fae#diff-6759d92c079f0957bfa814e339e10e7eR301
    ). I originally defined `val instances: RDD[OffsetInstance]` outside the `ifelse` and then convert it to `RDD[Instance]` for the Gaussian identity link case. But this will incur one extra map. There was some concern that this could be expensive. However, if this extra conversion is not a big deal, I can revert it back to that which is basically the implementation of the `OffsetInstance` interface for `WeightedLeastSquares`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72857/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98012093
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double,
    +                                   features: Vector) {
    +
    +  /** Constructs from an [[Instance]] object and offset */
    +  def this(instance: Instance, offset: Double = 0.0) = {
    +    this(instance.label, instance.weight, offset, instance.features)
    +  }
    +
    +  /** Converts to an [[Instance]] object by leaving out the offset. */
    +  private[ml] def toInstance: Instance = Instance(label, weight, features)
    --- End diff --
    
    it looks like this method is only used once in a test case, might it be better to remove it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124869366
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -961,14 +1008,16 @@ class GeneralizedLinearRegressionModel private[ml] (
       }
     
       override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
    --- End diff --
    
    Thanks for summarizing the different cases. I think this is worth a deeper discussion as follow-up work. Let me work on this in another PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100974912
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    +        var trainer = new GeneralizedLinearRegression().setFamily(family)
    --- End diff --
    
    Changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98010924
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -94,6 +94,17 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
       @Since("2.0.0")
       def getLinkPredictionCol: String = $(linkPredictionCol)
     
    +  /**
    +   * Param for offset column name. If this is not set or empty, we treat all
    +   * instance offsets as 0.0.
    +   * @group param
    +   */
    +  final val offsetCol: Param[String] = new Param[String](this, "offsetCol", "The offset " +
    +    "column name. If this is not set or empty, we treat all instance offsets as 0.0")
    +
    +  /** @group getParam */
    +  def getOffsetCol: String = $(offsetCol)
    --- End diff --
    
    thanks for fixing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @imatiach-msft Thanks so much for your detailed review. Incredibly helpful. I've addressed all your comments in the new commit. Major changes are highlighted below:
    1. Create `OffsetInstance` in Instance.scala and remove dependency of IWLS on regression. 
    2. Add param checking for offset. 
    
    Let me know if there is anything else that needs improvement. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @actuaryzhang thank you for updating the PR - I agree with your first two points.  With regards to the method of RDD[Instance] also working with RDD[OffsetInstance], I think it should work, as long as you define the method such that it takes in the parameter myAwesomeMethod(myRdd: RDD[_ <: Instance])
    I think you need that "<:" in the method definition to allow the method to take in anything that inherits from Instance.
    
    With regards to "Would you please elaborate on the following comment..." it looks like I made a mistake, I did not notice that you were subtracting the offset from the response when creating the Instance object in the previous commit, in which case you can ignore that comment.
    
    The PR looks good to me, maybe the committers can comment on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Got it. I should pay more attention to that mailing list from now on :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78729/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124753530
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -961,14 +1008,16 @@ class GeneralizedLinearRegressionModel private[ml] (
       }
     
       override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
    --- End diff --
    
    I summarized all four cases for making prediction as following:
    
    Estimator(training data)  | Transformer(prediction data) | How R predict | How Spark predict
    ------------------------- | ----------------------------- | --------------- | ------------------
    w/ offset column | w/ offset column | use offset of prediction data | use offset of prediction data
    w/ offset column | w/o offset column | use offset of training data | not use offset
    w/o offset column | w/ offset column | not use offset | not use offset
    w/o offset column | w/o offset column | not use offset | not use offset
    
    For case 1 and 4, there is not that controversial.
    For case 2, the reason behind a different way to handle is we can't store all ```offset``` data in our model like what R does, but we should print a warning log to let users know that is different from R.
    For case 3, in your current implementation, it ignores whether the model was trained with offset. I think it might be worth discussing. I think the correct way should consider whether the model was trained with offset. If the model was trained without offset, we should ignore the offset column when making prediction on new dataset. Or at least, we should print out warning to remind users.
    However, I think we can discuss and resolve this issue in follow-up work. @actuaryzhang What do you think my proposal of how Spark make prediction? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124737015
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -134,6 +134,25 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
       @Since("2.0.0")
       def getLinkPredictionCol: String = $(linkPredictionCol)
     
    +  /**
    +   * Param for offset column name. If this is not set or empty, we treat all instance offsets
    +   * as 0.0. The feature specified as offset has a constant coefficient of 1.0.
    +   * @group param
    +   */
    +  final val offsetCol: Param[String] = new Param[String](this, "offsetCol", "The offset " +
    --- End diff --
    
    ```@Since("2.3.0")```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    #18489 fixed the build failure. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98013552
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -263,17 +288,21 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    +    val instances: RDD[OffsetInstance] =
    +      dataset.select(col($(labelCol)), w, off, col($(featuresCol))).rdd.map {
    +        case Row(label: Double, weight: Double, offset: Double, features: Vector) =>
    +          OffsetInstance(label, weight, offset, features)
           }
     
         val model = if (familyObj == Gaussian && linkObj == Identity) {
           // TODO: Make standardizeFeatures and standardizeLabel configurable.
    +      val wlsInstances: RDD[Instance] = instances.map { instance =>
    +        Instance(instance.label - instance.offset, instance.weight, instance.features)
    --- End diff --
    
    another thing you can do, if you don't want to change the hierarchy, is to move the initialization of instances inside the if/else.  Then, for weightedleastsquares, you can just create RDD[Instance], but for IRLS weighted instances.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100934645
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -168,6 +179,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
         }
     
         val newSchema = super.validateAndTransformSchema(schema, fitting, featuresDataType)
    +    if (isSetOffsetCol(this)) SchemaUtils.checkNumericType(schema, $(offsetCol))
    --- End diff --
    
    +1 good catch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72095/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @sethah Thanks much for your review. I've made a new commit that addressed all your comments. Please see my inline comments.  Let me know if there is any other suggestions. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71995/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98569953
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,19 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val offset = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) {
    +      lit(0.0)
    +    } else {
    +      col($(offsetCol)).cast(DoubleType)
    +    }
    --- End diff --
    
    minor style comment - I think these 4 lines here need to be indented with 2 more spaces:
      lit(0.0)  
    } else {  
      col($(offsetCol)).cast(DoubleType)  
    }  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    you should not modify sharedParams directly. 
    "// DO NOT MODIFY THIS FILE! It was generated by SharedParamsCodeGen."
    
    And if there is no other algorithms inheriting hasoffset, I suggest that do not create this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72863/testReport)** for PR 16699 at commit [`4b336be`](https://github.com/apache/spark/commit/4b336be160fc75c4908ee7ac4a17b3eb6d04541c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124281406
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -961,14 +1007,30 @@ class GeneralizedLinearRegressionModel private[ml] (
       }
     
       override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
    -    val predictUDF = udf { (features: Vector) => predict(features) }
    -    val predictLinkUDF = udf { (features: Vector) => predictLink(features) }
    +    val predictUDF = udf { (features: Vector, offset: Double) => predict(features, offset) }
    +    val predictLinkUDF = udf { (features: Vector, offset: Double) => predictLink(features, offset) }
    +    /*
    +     Offset is only validated when it's specified in the model and available in prediction data set.
    +     When offset is specified but missing in the prediction data set, we default it to zero.
    +     */
    +    val offset = {
    +      if (!isSetOffsetCol(this)) {
    +        lit(0.0)
    +      } else {
    +        if (dataset.schema.fieldNames.contains($(offsetCol))) {
    +          SchemaUtils.checkNumericType(dataset.schema, $(offsetCol))
    +          col($(offsetCol)).cast(DoubleType)
    +        } else {
    +          lit(0.0)
    +        }
    +      }
    +    }
    --- End diff --
    
    Can we simplify it as following?
    ```
    val offset = if (!isSetOffsetCol(this)) lit(0.0) else col($(offsetCol)).cast(DoubleType)
    ```
    Here it's not necessary to run ```checkNumericType```, since it has been check in ```validateAndTransformSchema```.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72095/testReport)** for PR 16699 at commit [`52bc32b`](https://github.com/apache/spark/commit/52bc32b2d86b2cd5ce092f86ee61f8fe9aebec5d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124258933
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquaresSuite.scala ---
    @@ -43,7 +43,7 @@ class IterativelyReweightedLeastSquaresSuite extends SparkFunSuite with MLlibTes
           Instance(0.0, 2.0, Vectors.dense(1.0, 2.0)),
           Instance(1.0, 3.0, Vectors.dense(2.0, 1.0)),
           Instance(0.0, 4.0, Vectors.dense(3.0, 3.0))
    -    ), 2)
    +    ), 2).map(new OffsetInstance(_))
    --- End diff --
    
    Please construct ```OffsetInstance``` with offset = 0.0 directly, it will make code more easy to understand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124233224
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -406,6 +437,14 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
     
       private[regression] val epsilon: Double = 1E-16
     
    +  /** Checks whether weight column is set and nonempty */
    +  private[regression] def isSetWeightCol(params: GeneralizedLinearRegressionBase): Boolean =
    --- End diff --
    
    Can we move the two functions to ```GeneralizedLinearRegressionBase``` and rename as ```hasWeightCol```, just like ```hasLinkPredictionCol```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78727/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78920/testReport)** for PR 16699 at commit [`db0ac93`](https://github.com/apache/spark/commit/db0ac937b9df678f7f58b26bab1afa43d77eb5a1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124272947
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,184 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    +        val trainer = new GeneralizedLinearRegression().setFamily(family)
    +          .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +          .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +        if (family == "tweedie") trainer.setVariancePower(1.5)
    +        val model = trainer.fit(dataset)
    +        val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +        assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    +          s" and fitIntercept = $fitIntercept.")
    +
    +        val familyLink = FamilyAndLink(trainer)
    +        model.transform(dataset).select("features", "offset", "prediction", "linkPrediction")
    +          .collect().foreach {
    +          case Row(features: DenseVector, offset: Double, prediction1: Double,
    +          linkPrediction1: Double) =>
    +            val eta = BLAS.dot(features, model.coefficients) + model.intercept + offset
    +            val prediction2 = familyLink.fitted(eta)
    +            val linkPrediction2 = eta
    +            assert(prediction1 ~= prediction2 relTol 1E-5, "Prediction mismatch: GLM with " +
    +              s"family = $family, and fitIntercept = $fitIntercept.")
    +            assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch: " +
    +              s"GLM with family = $family, and fitIntercept = $fitIntercept.")
    +        }
    +
    +        idx += 1
    +      }
    +    }
    +  }
    +
    +  test("generalized linear regression: predict with no offset") {
    +    val trainData = Seq(
    +      OffsetInstance(2.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(8.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
    +      OffsetInstance(3.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
    +      OffsetInstance(9.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
    +    ).toDF()
    +    val testData = trainData.select("weight", "features")
    +
    +    val trainer = new GeneralizedLinearRegression()
    +      .setFamily("poisson")
    +      .setWeightCol("weight")
    +      .setOffsetCol("offset")
    +      .setLinkPredictionCol("linkPrediction")
    +
    +    val model = trainer.fit(trainData)
    +    model.transform(testData).select("features", "linkPrediction")
    +      .collect().foreach {
    +      case Row(features: DenseVector, linkPrediction1: Double) =>
    +        val linkPrediction2 = BLAS.dot(features, model.coefficients) + model.intercept
    +        assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch")
    +    }
    +  }
    +
    +  test("glm summary: gaussian family with weight and offset") {
         /*
    -       R code:
    +      R code:
     
    -       model <- glm(formula = "b ~ .", family="gaussian", data = df, weights = w)
    -       summary(model)
    +      A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    +      b <- c(17, 19, 23, 29)
    +      w <- c(1, 2, 3, 4)
    +      off <- c(2, 3, 1, 4)
    +      df <- as.data.frame(cbind(A, b))
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(17.0, 1.0, 2.0, Vectors.dense(0.0, 5.0).toSparse),
    +      OffsetInstance(19.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
    +      OffsetInstance(23.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
    +      OffsetInstance(29.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
    +    ).toDF()
    +    /*
    +      R code:
     
    -       Deviance Residuals:
    -           1       2       3       4
    -       1.920  -1.358  -1.109   0.960
    +      model <- glm(formula = "b ~ .", family = "gaussian", data = df,
    +                   weights = w, offset = off)
    +      summary(model)
     
    -       Coefficients:
    -                   Estimate Std. Error t value Pr(>|t|)
    -       (Intercept)   18.080      9.608   1.882    0.311
    -       V1             6.080      5.556   1.094    0.471
    -       V2            -0.600      1.960  -0.306    0.811
    +      Deviance Residuals:
    +            1        2        3        4
    +       0.9600  -0.6788  -0.5543   0.4800
     
    -       (Dispersion parameter for gaussian family taken to be 7.68)
    +      Coefficients:
    +                  Estimate Std. Error t value Pr(>|t|)
    +      (Intercept)   5.5400     4.8040   1.153    0.455
    +      V1           -0.9600     2.7782  -0.346    0.788
    +      V2            1.7000     0.9798   1.735    0.333
     
    -           Null deviance: 202.00  on 3  degrees of freedom
    -       Residual deviance:   7.68  on 1  degrees of freedom
    -       AIC: 18.783
    +      (Dispersion parameter for gaussian family taken to be 1.92)
     
    -       Number of Fisher Scoring iterations: 2
    +          Null deviance: 152.10  on 3  degrees of freedom
    +      Residual deviance:   1.92  on 1  degrees of freedom
    +      AIC: 13.238
     
    -       residuals(model, type="pearson")
    -              1         2         3         4
    -       1.920000 -1.357645 -1.108513  0.960000
    +      Number of Fisher Scoring iterations: 2
     
    -       residuals(model, type="working")
    +      residuals(model, type = "pearson")
    +               1          2          3          4
    +      0.9600000 -0.6788225 -0.5542563  0.4800000
    +      residuals(model, type = "working")
               1     2     3     4
    -       1.92 -0.96 -0.64  0.48
    -
    -       residuals(model, type="response")
    +      0.96 -0.48 -0.32  0.24
    +      residuals(model, type = "response")
               1     2     3     4
    -       1.92 -0.96 -0.64  0.48
    +      0.96 -0.48 -0.32  0.24
    --- End diff --
    
    Ditto, alignment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71994/testReport)** for PR 16699 at commit [`d2afcb0`](https://github.com/apache/spark/commit/d2afcb0c335c133f1a86c8587fd7459ac39935a0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72003/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124235464
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -303,6 +317,17 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
       def setWeightCol(value: String): this.type = set(weightCol, value)
     
       /**
    +   * Sets the value of param [[offsetCol]].
    +   * The feature specified as offset has a constant coefficient of 1.0.
    --- End diff --
    
    Move this line to param doc. Usually we keep the most integrated doc in param annotation, and for set method, we can just say ```Sets the value of param [[offsetCol]]```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124272854
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,184 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    +        val trainer = new GeneralizedLinearRegression().setFamily(family)
    +          .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +          .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +        if (family == "tweedie") trainer.setVariancePower(1.5)
    +        val model = trainer.fit(dataset)
    +        val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +        assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    +          s" and fitIntercept = $fitIntercept.")
    +
    +        val familyLink = FamilyAndLink(trainer)
    +        model.transform(dataset).select("features", "offset", "prediction", "linkPrediction")
    +          .collect().foreach {
    +          case Row(features: DenseVector, offset: Double, prediction1: Double,
    +          linkPrediction1: Double) =>
    +            val eta = BLAS.dot(features, model.coefficients) + model.intercept + offset
    +            val prediction2 = familyLink.fitted(eta)
    +            val linkPrediction2 = eta
    +            assert(prediction1 ~= prediction2 relTol 1E-5, "Prediction mismatch: GLM with " +
    +              s"family = $family, and fitIntercept = $fitIntercept.")
    +            assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch: " +
    +              s"GLM with family = $family, and fitIntercept = $fitIntercept.")
    +        }
    +
    +        idx += 1
    +      }
    +    }
    +  }
    +
    +  test("generalized linear regression: predict with no offset") {
    +    val trainData = Seq(
    +      OffsetInstance(2.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(8.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
    +      OffsetInstance(3.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
    +      OffsetInstance(9.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
    +    ).toDF()
    +    val testData = trainData.select("weight", "features")
    +
    +    val trainer = new GeneralizedLinearRegression()
    +      .setFamily("poisson")
    +      .setWeightCol("weight")
    +      .setOffsetCol("offset")
    +      .setLinkPredictionCol("linkPrediction")
    +
    +    val model = trainer.fit(trainData)
    +    model.transform(testData).select("features", "linkPrediction")
    +      .collect().foreach {
    +      case Row(features: DenseVector, linkPrediction1: Double) =>
    +        val linkPrediction2 = BLAS.dot(features, model.coefficients) + model.intercept
    +        assert(linkPrediction1 ~= linkPrediction2 relTol 1E-5, "Link Prediction mismatch")
    +    }
    +  }
    +
    +  test("glm summary: gaussian family with weight and offset") {
         /*
    -       R code:
    +      R code:
     
    -       model <- glm(formula = "b ~ .", family="gaussian", data = df, weights = w)
    -       summary(model)
    +      A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    +      b <- c(17, 19, 23, 29)
    +      w <- c(1, 2, 3, 4)
    +      off <- c(2, 3, 1, 4)
    +      df <- as.data.frame(cbind(A, b))
    +     */
    +    val dataset = Seq(
    +      OffsetInstance(17.0, 1.0, 2.0, Vectors.dense(0.0, 5.0).toSparse),
    +      OffsetInstance(19.0, 2.0, 3.0, Vectors.dense(1.0, 7.0)),
    +      OffsetInstance(23.0, 3.0, 1.0, Vectors.dense(2.0, 11.0)),
    +      OffsetInstance(29.0, 4.0, 4.0, Vectors.dense(3.0, 13.0))
    +    ).toDF()
    +    /*
    +      R code:
     
    -       Deviance Residuals:
    -           1       2       3       4
    -       1.920  -1.358  -1.109   0.960
    +      model <- glm(formula = "b ~ .", family = "gaussian", data = df,
    +                   weights = w, offset = off)
    +      summary(model)
     
    -       Coefficients:
    -                   Estimate Std. Error t value Pr(>|t|)
    -       (Intercept)   18.080      9.608   1.882    0.311
    -       V1             6.080      5.556   1.094    0.471
    -       V2            -0.600      1.960  -0.306    0.811
    +      Deviance Residuals:
    +            1        2        3        4
    +       0.9600  -0.6788  -0.5543   0.4800
     
    -       (Dispersion parameter for gaussian family taken to be 7.68)
    +      Coefficients:
    +                  Estimate Std. Error t value Pr(>|t|)
    +      (Intercept)   5.5400     4.8040   1.153    0.455
    +      V1           -0.9600     2.7782  -0.346    0.788
    +      V2            1.7000     0.9798   1.735    0.333
     
    -           Null deviance: 202.00  on 3  degrees of freedom
    -       Residual deviance:   7.68  on 1  degrees of freedom
    -       AIC: 18.783
    +      (Dispersion parameter for gaussian family taken to be 1.92)
     
    -       Number of Fisher Scoring iterations: 2
    +          Null deviance: 152.10  on 3  degrees of freedom
    +      Residual deviance:   1.92  on 1  degrees of freedom
    +      AIC: 13.238
     
    -       residuals(model, type="pearson")
    -              1         2         3         4
    -       1.920000 -1.357645 -1.108513  0.960000
    +      Number of Fisher Scoring iterations: 2
     
    -       residuals(model, type="working")
    +      residuals(model, type = "pearson")
    +               1          2          3          4
    +      0.9600000 -0.6788225 -0.5542563  0.4800000
    +      residuals(model, type = "working")
               1     2     3     4
    -       1.92 -0.96 -0.64  0.48
    -
    -       residuals(model, type="response")
    +      0.96 -0.48 -0.32  0.24
    --- End diff --
    
    Minor: Note alignment between this and above line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98018075
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -753,14 +794,15 @@ class GeneralizedLinearRegressionModel private[ml] (
       }
     
       override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
    -    val predictUDF = udf { (features: Vector) => predict(features) }
    -    val predictLinkUDF = udf { (features: Vector) => predictLink(features) }
    +    val predictUDF = udf { (features: Vector, offset: Double) => predict(features, offset) }
    +    val predictLinkUDF = udf { (features: Vector, offset: Double) => predictLink(features, offset) }
    +    val off = if (!isSet(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    --- End diff --
    
    also, as I mentioned above, if family = Gaussian and link = Identity we shouldn't be passing any offset, but then the user shouldn't be setting an offset column probably - so either adding that to the validation logic or to the code here or to the predict function you call below would fix this unusual case; I think this seems like a validation issue and probably should be added to the validation method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r99406022
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -743,6 +743,84 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    +  test("generalized linear regression with offset") {
    +    /*
    +      R code:
    +      library(statmod)
    +      df <- as.data.frame(matrix(c(
    +        1.0, 1.0, 2.0, 0.0, 5.0,
    +        2.0, 2.0, 0.5, 1.0, 2.0,
    +        1.0, 3.0, 1.0, 2.0, 1.0,
    +        2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +
    +      [1] 0.535040431 0.005390836
    +      [1]  0.1968355 -0.2061711
    +      [1]  0.307996 -0.153579
    +      [1]  0.32166185 -0.09698986
    +      [1] -0.8800000  0.7342857  0.1714286
    +      [1] -1.9991044  0.7247511  0.1424392
    +      [1] -0.27378146  0.31599396 -0.06204946
    +      [1] -0.17118812  0.31200361 -0.02541656
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0))
    +    ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0.0, 0.535040431, 0.005390836),
    +      Vectors.dense(0.0, 0.1968355, -0.2061711),
    +      Vectors.dense(0.0, 0.307996, -0.153579),
    +      Vectors.dense(0.0, 0.32166185, -0.09698986),
    +      Vectors.dense(-0.88, 0.7342857, 0.1714286),
    +      Vectors.dense(-1.9991044, 0.7247511, 0.1424392),
    +      Vectors.dense(-0.27378146, 0.31599396, -0.06204946),
    +      Vectors.dense(-0.17118812, 0.31200361, -0.02541656))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "poisson", "gamma", "tweedie")) {
    +        var trainer = new GeneralizedLinearRegression().setFamily(family)
    +          .setFitIntercept(fitIntercept).setOffsetCol("offset")
    +          .setWeightCol("weight").setLinkPredictionCol("linkPrediction")
    +        if (family == "tweedie") trainer = trainer.setVariancePower(1.5)
    +        val model = trainer.fit(dataset)
    +        val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
    +        assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," +
    --- End diff --
    
    Thanks for pointing out this. Yes, I also encountered the issue with intercept only + offset model. Will look into this and fix. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78727/testReport)** for PR 16699 at commit [`1e47a11`](https://github.com/apache/spark/commit/1e47a11f6a3e1292426ea25e7728b805c6650e1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @imatiach-msft Thanks much for your review. Renamed `off`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72003/testReport)** for PR 16699 at commit [`d44974c`](https://github.com/apache/spark/commit/d44974cfe50092bb639a31aa7aa9b16eb1d21fae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98014650
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -736,15 +765,27 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    --- End diff --
    
    it looks like if the family = Gaussian and link = Identity you don't even need to check for offsetCol here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78920/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124237318
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -944,15 +983,22 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = FamilyAndLink(this)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    +    predict(features, 0.0)
    +  }
    +
    +  /**
    +   * Calculates the predicted value when offset is set.
    +   */
    +  protected def predict(features: Vector, offset: Double): Double = {
    --- End diff --
    
    Why make it ```protected```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72863/testReport)** for PR 16699 at commit [`4b336be`](https://github.com/apache/spark/commit/4b336be160fc75c4908ee7ac4a17b3eb6d04541c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    This looks really good.  I made a couple more comments.  I'm not sure about the class hierarchy of Instance/OffsetInstance.  Maybe committers can take a look/comment on this PR? @jkbradley @srowen @yanboliang Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78729/testReport)** for PR 16699 at commit [`1e47a11`](https://github.com/apache/spark/commit/1e47a11f6a3e1292426ea25e7728b805c6650e1b).
     * This patch **fails PySpark pip packaging tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72857/testReport)** for PR 16699 at commit [`90d68a6`](https://github.com/apache/spark/commit/90d68a67815aceae63eaad7345477a082bb2febd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @sethah The predict method can work with new data in R. See below. Shall we focus on the current implementation, instead of discussing the details of the R behavior? :) 
    Let me know if there is any suggestion on the last commit. Thanks. 
    
    ```
    df <- data.frame(y = rnorm(10), x = rnorm(10), off = rnorm(10))
    model <- glm(y ~ x, data = df, offset = off)
    df2 <- data.frame(y = rnorm(20), x = rnorm(20), off = rnorm(20))
    p1 <- predict(model, df2, type = "response")
    p2 <- as.numeric(model.matrix(y~x, df2) %*% coef(model) + df2$off)
    sum(p1 - p2)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r97877367
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -94,6 +94,17 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
       @Since("2.0.0")
       def getLinkPredictionCol: String = $(linkPredictionCol)
     
    +  /**
    +   * Param for offset column name. If this is not set or empty, we treat all
    +   * instance offsets as 0.0.
    +   * @group param
    +   */
    +  final val offsetCol: Param[String] = new Param[String](this, "offsetCol", "The offset " +
    +    "column name. If this is not set or empty, we treat all instance offsets as 0.0")
    +
    +  /** @group getParam */
    +  def getOffsetCol: String = $(offsetCol)
    --- End diff --
    
    it looks like you will need to update the validateAndTransformSchema method below to validate these parameters - eg check if the column exists? (similar to what the base class does for features/label columns)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100975164
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -168,6 +179,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
         }
     
         val newSchema = super.validateAndTransformSchema(schema, fitting, featuresDataType)
    +    if (isSetOffsetCol(this)) SchemaUtils.checkNumericType(schema, $(offsetCol))
    --- End diff --
    
    @sethah This is a great point. The new commit does allow offset to be missing when making predictions. I now check validity of offset only when it's set and available in the prediction set. Otherwise, set offset to be zero. Thanks for catching this and contributing the test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98169691
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -336,14 +361,15 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    -      }
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    --- End diff --
    
    cast `offsetCol` to DoubleType here if you want to support non-double numeric datatype. 
    `weightCol` and `labelCol` are automaticly cast into DoubleType in `Predictor.scala`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    I'm finding R's behavior for prediction with offsets to be a bit strange. Yes, R does use the original offsets supplied during training to do prediction, but what if I want to make predictions on a completely new dataset, with completely different offsets? R still tries to use the original ones, which doesn't make sense to me. Then, maybe predicting on "new" data when you've trained using offsets doesn't make sense? Regardless, here we are allowing users to pass in a new dataframe to make predictions on, where they can supply an offset column for the new "test" data - offsets that are not the same as the training offsets. This is a feature not provided by R, it seems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @zhengruifeng @imatiach-msft 
    Thanks much for pointing out the issue due to the hasOffset trait. This is what caused the test to fail. I have moved it to the GLRBase class. Things are working now \U0001f44d 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98011972
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -263,17 +288,21 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
         }
     
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    -    val instances: RDD[Instance] =
    -      dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
    -        case Row(label: Double, weight: Double, features: Vector) =>
    -          Instance(label, weight, features)
    +    val off = if (!isDefined(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    +    val instances: RDD[OffsetInstance] =
    +      dataset.select(col($(labelCol)), w, off, col($(featuresCol))).rdd.map {
    +        case Row(label: Double, weight: Double, offset: Double, features: Vector) =>
    +          OffsetInstance(label, weight, offset, features)
           }
     
         val model = if (familyObj == Gaussian && linkObj == Identity) {
           // TODO: Make standardizeFeatures and standardizeLabel configurable.
    +      val wlsInstances: RDD[Instance] = instances.map { instance =>
    +        Instance(instance.label - instance.offset, instance.weight, instance.features)
    --- End diff --
    
    going over all OffsetInstance and converting again to Instance seems like it would be expensive and it would increase memory usage.  What do you think about the alternative of making OffsetInstance inherit from Instance - but then you would have to change Instance from a case class - and then just passing Instance to the fit below?  Is there anything else we could do here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72040/testReport)** for PR 16699 at commit [`e183c08`](https://github.com/apache/spark/commit/e183c08373733249d7b9c476875c2a4a2dff3c05).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124234628
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -168,6 +179,9 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
         }
     
         val newSchema = super.validateAndTransformSchema(schema, fitting, featuresDataType)
    +    if (fitting) {
    +      if (isSetOffsetCol(this)) SchemaUtils.checkNumericType(schema, $(offsetCol))
    +    }
    --- End diff --
    
    We need check numeric type for both fit & transform. ```Offset``` column was used when model prediction if applicable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @zhengruifeng Thanks for the suggestions. Added casting and instrumentation. 
    @imatiach-msft Thanks for the clarification! It is probably worth  another PR to clean up all tests in GLM.  
    Let me know if there is any additional comments! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98017514
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -753,14 +794,15 @@ class GeneralizedLinearRegressionModel private[ml] (
       }
     
       override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
    -    val predictUDF = udf { (features: Vector) => predict(features) }
    -    val predictLinkUDF = udf { (features: Vector) => predictLink(features) }
    +    val predictUDF = udf { (features: Vector, offset: Double) => predict(features, offset) }
    +    val predictLinkUDF = udf { (features: Vector, offset: Double) => predictLink(features, offset) }
    +    val off = if (!isSet(offsetCol) || $(offsetCol).isEmpty) lit(0.0) else col($(offsetCol))
    --- End diff --
    
    calling the offset column "off" was a little confusing to me, maybe we can give it a better name, like "offsetCol" or even "offset".  When I looked at the code below I first thought it was some sort of flag passed to the predict function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/16699

@actuaryzhang This is looking pretty good overall. Regarding the prediction logic, R glm does not allow you to predict with offsets, correct? I notice that statsmodels in Python _does_ allow it. For, e.g. Poisson regression, predicting with offsets is a _slight_ convenience, I guess. The point of training with offsets in this case is to effectively model _rates_ instead of _counts_, and so when you predict you'd be interested in the rates and predicting without offsets, that's exactly what you get. If you want to apply those rates to a specific exposure time, then you just take your predictions and multiply them with `exposure * y_hat` where y_hat is the predicted rate. I think I have that correct. My question then, is do we need to support prediction with offset? It makes things clunkier in the code, so there is some reason not to. Since you presumably have more experience with using GLMs, I'd like to hear your thoughts.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r97872960
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -358,13 +384,13 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
          * The reweight function used to update offsets and weights
          * at each iteration of [[IterativelyReweightedLeastSquares]].
          */
    -    val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, Double) = {
    -      (instance: Instance, model: WeightedLeastSquaresModel) => {
    -        val eta = model.predict(instance.features)
    +    val reweightFunc: (GLRInstance, WeightedLeastSquaresModel) => (Double, Double) = {
    +      (instance: GLRInstance, model: WeightedLeastSquaresModel) => {
    +        val eta = model.predict(instance.features) + instance.offset
    --- End diff --
    
    minor suggestion - instead of doing:
    eta = prediction + offset
    mu = fitted(eta)
    newLabel = eta - offset + stuff
    maybe do:
    eta = prediction
    mu = fitted(eta + offset)
    newLabel = eta + stuff



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72068/testReport)** for PR 16699 at commit [`da4174a`](https://github.com/apache/spark/commit/da4174a6c2639001828e587794e1a32dcf8db15d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72672/testReport)** for PR 16699 at commit [`fc64d32`](https://github.com/apache/spark/commit/fc64d32e72e1e406d6f7ab27563ea622aa7b1397).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124265572
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquaresSuite.scala ---
    @@ -169,29 +169,29 @@ class IterativelyReweightedLeastSquaresSuite extends SparkFunSuite with MLlibTes
     object IterativelyReweightedLeastSquaresSuite {
     
       def BinomialReweightFunc(
    -      instance: Instance,
    +      instance: OffsetInstance,
           model: WeightedLeastSquaresModel): (Double, Double) = {
    -    val eta = model.predict(instance.features)
    +    val eta = model.predict(instance.features) + instance.offset
         val mu = 1.0 / (1.0 + math.exp(-1.0 * eta))
    -    val z = eta + (instance.label - mu) / (mu * (1.0 - mu))
    +    val z = eta - instance.offset + (instance.label - mu) / (mu * (1.0 - mu))
    --- End diff --
    
    Why ```- instance.offset```? I suspect it's wrong, the test doesn't fail because it's zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100574976
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    --- End diff --
    
    What do you think about changing this to:
    
    `for (family <- GeneralizedLinearRegression.supportedFamilyNames)` ?
    
    That way we are being sure that we include all the families. I realize that this would break the ordering and therefore require changing up some of the code. I'd be quite alright doing that in a different PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @yanboliang @sethah 
    Any suggestion on moving this PR forward? Appreciate your comments and reviews. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78920/testReport)** for PR 16699 at commit [`db0ac93`](https://github.com/apache/spark/commit/db0ac937b9df678f7f58b26bab1afa43d77eb5a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100919210
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -406,6 +435,14 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
     
       private[regression] val epsilon: Double = 1E-16
     
    +  /** Checks whether weight column is set and nonempty */
    +  private[regression] def isSetWeightCol(params: GeneralizedLinearRegressionBase): Boolean =
    +    params.isSet(params.weightCol) && !params.getWeightCol.isEmpty
    +
    +  /** Checks whether offset column is set and nonempty */
    +  private[regression] def isSetOffsetCol(params: GeneralizedLinearRegressionBase): Boolean =
    +    params.isSet(params.offsetCol) && !params.getOffsetCol.isEmpty
    --- End diff --
    
    These should be using `isDefined` instead of `isSet` I believe. Also, I think using `params.getOffsetCol.nonEmpty` is slightly clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16699

@imatiach-msft Many thanks again for the review. I have incorporated some of your suggestions:
1. Create initialization of instance directly if it is Gaussian(identity) to avoid expensive conversion from OffsetInstance.
2. Fix spacing issues in test
3. Change `off` to `offset` to be more informative.

Feedback on the other suggestions:
1. `move import to beginning, use ~==`: I did not change this for consistency with other tests in GLM. It's probably better to update all tests in a separate PR if this is preferred.
2. Inheriting Instance: I'm not sure about this and leaning toward not touching `Instance` since it is used a lot elsewhere. Let's see what others suggest.
Question: if `OffsetInstance` is a subclass of `Instance`, will a method with a signature `RDD[Instance]` also work with `RDD[OffsetInstance]`?
3. Would you please elaborate on the following comment:
> if family = Gaussian and link = Identity we shouldn't be passing any offset.

This is the linear regression case and it still allows `offset`. The offset will be subtracted from the response as can be seen from the initialization in WLS `Instance(label - offset, weight, features)`.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #72162 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72162/testReport)** for PR 16699 at commit [`59e10f7`](https://github.com/apache/spark/commit/59e10f77f2676a3755536a46d171ee723d226ee5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #71975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71975/testReport)** for PR 16699 at commit [`d071b95`](https://github.com/apache/spark/commit/d071b95ccc94404d37cde7cc122cf8a13fd04449).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged into master. Thanks for contribution and all reviews! This great feature will benefit lots of users.
    @actuaryzhang Could you send follow-up PRs to address the two inline comments? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100917713
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -1139,54 +1189,52 @@ class GeneralizedLinearRegressionSummary private[regression] (
     
       /** Degrees of freedom. */
       @Since("2.0.0")
    -  lazy val degreesOfFreedom: Long = {
    -    numInstances - rank
    -  }
    +  lazy val degreesOfFreedom: Long = numInstances - rank
     
       /** The residual degrees of freedom. */
       @Since("2.0.0")
       lazy val residualDegreeOfFreedom: Long = degreesOfFreedom
     
       /** The residual degrees of freedom for the null model. */
       @Since("2.0.0")
    -  lazy val residualDegreeOfFreedomNull: Long = if (model.getFitIntercept) {
    -    numInstances - 1
    -  } else {
    -    numInstances
    +  lazy val residualDegreeOfFreedomNull: Long = {
    +    if (model.getFitIntercept) numInstances - 1 else numInstances
       }
     
    -  private def weightCol: Column = {
    -    if (!model.isDefined(model.weightCol) || model.getWeightCol.isEmpty) {
    -      lit(1.0)
    -    } else {
    -      col(model.getWeightCol)
    -    }
    +  private def label: Column = col(model.getLabelCol).cast(DoubleType)
    +
    +  private def prediction: Column = col(predictionCol)
    +
    +  private def weight: Column = {
    +    if (!isSetWeightCol(model)) lit(1.0) else col(model.getWeightCol)
    --- End diff --
    
    This is computed differently than before. Is there a reason why? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124262306
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquaresSuite.scala ---
    @@ -156,7 +156,7 @@ class IterativelyReweightedLeastSquaresSuite extends SparkFunSuite with MLlibTes
         var idx = 0
         for (fitIntercept <- Seq(false, true)) {
           val initial = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0,
    -        standardizeFeatures = false, standardizeLabel = false).fit(instances2)
    +        standardizeFeatures = false, standardizeLabel = false).fit(instances2.map(_.toInstance))
    --- End diff --
    
    See my above comment about adding interface ```fit(instances: RDD[OffsetInstance])```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78812/testReport)** for PR 16699 at commit [`1e47a11`](https://github.com/apache/spark/commit/1e47a11f6a3e1292426ea25e7728b805c6650e1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    **[Test build #78729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78729/testReport)** for PR 16699 at commit [`1e47a11`](https://github.com/apache/spark/commit/1e47a11f6a3e1292426ea25e7728b805c6650e1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124733535
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,29 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + * This is mainly used in GeneralizedLinearRegression currently.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(
    +    label: Double,
    +    weight: Double,
    +    offset: Double,
    +    features: Vector) {
    +
    +  /** Constructs from an [[Instance]] object and offset */
    +  def this(instance: Instance, offset: Double = 0.0) = {
    +    this(instance.label, instance.weight, offset, instance.features)
    +  }
    +
    +  /** Converts to an [[Instance]] object by leaving out the offset. */
    +  private[ml] def toInstance: Instance = Instance(label, weight, features)
    --- End diff --
    
    Remove ```private[ml]``` since you have marked the whole class as ```private[ml]```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    @yanboliang Thanks much for the review. The new commit includes everything you suggested except implementing `WeightLeastSquares` interface for `OffsetInstance`. Please see my incline comments above. 
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r98012237
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala ---
    @@ -27,3 +27,25 @@ import org.apache.spark.ml.linalg.Vector
      * @param features The vector of features for this data point.
      */
     private[ml] case class Instance(label: Double, weight: Double, features: Vector)
    +
    +/**
    + * Case class that represents an instance of data point with
    + * label, weight, offset and features.
    + *
    + * @param label Label for this data point.
    + * @param weight The weight of this instance.
    + * @param offset The offset used for this data point.
    + * @param features The vector of features for this data point.
    + */
    +private[ml] case class OffsetInstance(label: Double, weight: Double, offset: Double,
    +                                   features: Vector) {
    +
    +  /** Constructs from an [[Instance]] object and offset */
    +  def this(instance: Instance, offset: Double = 0.0) = {
    +    this(instance.label, instance.weight, offset, instance.features)
    +  }
    +
    +  /** Converts to an [[Instance]] object by leaving out the offset. */
    +  private[ml] def toInstance: Instance = Instance(label, weight, features)
    --- End diff --
    
    actually, another alternative might be to make OffsetInstance inherit from Instance (which I wrote below as well).  What do you think about this idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78812/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124402141
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquaresSuite.scala ---
    @@ -169,29 +169,29 @@ class IterativelyReweightedLeastSquaresSuite extends SparkFunSuite with MLlibTes
     object IterativelyReweightedLeastSquaresSuite {
     
       def BinomialReweightFunc(
    -      instance: Instance,
    +      instance: OffsetInstance,
           model: WeightedLeastSquaresModel): (Double, Double) = {
    -    val eta = model.predict(instance.features)
    +    val eta = model.predict(instance.features) + instance.offset
         val mu = 1.0 / (1.0 + math.exp(-1.0 * eta))
    -    val z = eta + (instance.label - mu) / (mu * (1.0 - mu))
    +    val z = eta - instance.offset + (instance.label - mu) / (mu * (1.0 - mu))
    --- End diff --
    
    Indeed this is the correct implementation: in the IRWLS, we only include offset when computing `mu` and use `Xb` (without offset) when updating the working label. To see this clearly, one would have to derive the IRWLS. But for a quick reference, below is R's implementation:
    
    ```
    eta <- drop(x %*% start)
    mu <- linkinv(eta <- eta + offset)
    z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
    w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
    fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * 
                  w, z * w, min(1e-07, control$epsilon/1000), check = FALSE)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Ah, thank you very much for that clarification, I don't have much experience using R. I tried this out earlier, and it seems you have to use the same offset column name as you show above. I apologize for not making my intentions clear earlier - I wanted to discuss the R behavior to make sure we were doing the right thing here. 
    
    We actually discuss R's behavior a lot and even use it as a baseline for our unit tests, as you know. I think being able to compare new Spark PRs to other packages enables reviewers who may not be experts in a particular area to still contribute. More importantly, since it takes quite a lot of time and effort to do code review I like to make sure we have decided on the high-level approach before spending time on implementation details, which I find saves time in the long run. JMO, hope it makes sense :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100975556
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -406,6 +435,14 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
     
       private[regression] val epsilon: Double = 1E-16
     
    +  /** Checks whether weight column is set and nonempty */
    +  private[regression] def isSetWeightCol(params: GeneralizedLinearRegressionBase): Boolean =
    +    params.isSet(params.weightCol) && !params.getWeightCol.isEmpty
    +
    +  /** Checks whether offset column is set and nonempty */
    +  private[regression] def isSetOffsetCol(params: GeneralizedLinearRegressionBase): Boolean =
    +    params.isSet(params.offsetCol) && !params.getOffsetCol.isEmpty
    --- End diff --
    
    I adopted `params.getOffsetCol.nonEmpty`. 
    As for `isDefined` VS `isSet`, in my other PR #16344, @yanboliang suggested `isSet` is more accurate and we should be using `isSet` and may need to change the existing `isDefined` at some point. For this reason, I'm using `isSet` (which I believe checks whether the param is explicitly set by the user).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100975590
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -944,15 +981,27 @@ class GeneralizedLinearRegressionModel private[ml] (
       private lazy val familyAndLink = FamilyAndLink(this)
     
       override protected def predict(features: Vector): Double = {
    -    val eta = predictLink(features)
    +    if (!isSetOffsetCol(this)) {
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100976709
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -1139,54 +1189,52 @@ class GeneralizedLinearRegressionSummary private[regression] (
     
       /** Degrees of freedom. */
       @Since("2.0.0")
    -  lazy val degreesOfFreedom: Long = {
    -    numInstances - rank
    -  }
    +  lazy val degreesOfFreedom: Long = numInstances - rank
     
       /** The residual degrees of freedom. */
       @Since("2.0.0")
       lazy val residualDegreeOfFreedom: Long = degreesOfFreedom
     
       /** The residual degrees of freedom for the null model. */
       @Since("2.0.0")
    -  lazy val residualDegreeOfFreedomNull: Long = if (model.getFitIntercept) {
    -    numInstances - 1
    -  } else {
    -    numInstances
    +  lazy val residualDegreeOfFreedomNull: Long = {
    +    if (model.getFitIntercept) numInstances - 1 else numInstances
       }
     
    -  private def weightCol: Column = {
    -    if (!model.isDefined(model.weightCol) || model.getWeightCol.isEmpty) {
    -      lit(1.0)
    -    } else {
    -      col(model.getWeightCol)
    -    }
    +  private def label: Column = col(model.getLabelCol).cast(DoubleType)
    +
    +  private def prediction: Column = col(predictionCol)
    +
    +  private def weight: Column = {
    +    if (!isSetWeightCol(model)) lit(1.0) else col(model.getWeightCol)
    --- End diff --
    
    The computation is the same, but I just changed the name. The original name was kind of confusing: the name `xxxCol` is used to refer the string name in almost all other places, but here it was used as `Column`.  Since I need to add the offset column, I just updated the name  so that they are consistent. Also, the method `isSetWeightCol` is a simple wrapper of the old logic that checks both setting and nonempty of weightCol.  It is also used in a few other places, hence the wrapper. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    whew, this was a lot of work :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r100624754
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala ---
    @@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
         }
       }
     
    -  test("glm summary: gaussian family with weight") {
    +  test("generalized linear regression with offset") {
         /*
    -       R code:
    +      R code:
    +      library(statmod)
     
    -       A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
    -       b <- c(17, 19, 23, 29)
    -       w <- c(1, 2, 3, 4)
    -       df <- as.data.frame(cbind(A, b))
    -     */
    -    val datasetWithWeight = Seq(
    -      Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
    -      Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
    -      Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
    -      Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
    +      df <- as.data.frame(matrix(c(
    +        0.2, 1.0, 2.0, 0.0, 5.0,
    +        0.5, 2.1, 0.5, 1.0, 2.0,
    +        0.9, 0.4, 1.0, 2.0, 1.0,
    +        0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
    +      families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
    +      f1 <- V1 ~ -1 + V4 + V5
    +      f2 <- V1 ~ V4 + V5
    +      for (f in c(f1, f2)) {
    +        for (fam in families) {
    +          model <- glm(f, df, family = fam, weights = V2, offset = V3)
    +          print(as.vector(coef(model)))
    +        }
    +      }
    +      [1]  0.5169222 -0.3344444
    +      [1]  0.9419107 -0.6864404
    +      [1]  0.1812436 -0.6568422
    +      [1] -0.2869094  0.7857710
    +      [1] 0.1055254 0.2979113
    +      [1] -0.05990345  0.53188982 -0.32118415
    +      [1] -0.2147117  0.9911750 -0.6356096
    +      [1] -1.5616130  0.6646470 -0.3192581
    +      [1]  0.3390397 -0.3406099  0.6870259
    +      [1] 0.3665034 0.1039416 0.1484616
    +    */
    +    val dataset = Seq(
    +      OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
    +      OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
    +      OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
    +      OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
         ).toDF()
    +
    +    val expected = Seq(
    +      Vectors.dense(0, 0.5169222, -0.3344444),
    +      Vectors.dense(0, 0.9419107, -0.6864404),
    +      Vectors.dense(0, 0.1812436, -0.6568422),
    +      Vectors.dense(0, -0.2869094, 0.785771),
    +      Vectors.dense(0, 0.1055254, 0.2979113),
    +      Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
    +      Vectors.dense(-0.2147117, 0.991175, -0.6356096),
    +      Vectors.dense(-1.561613, 0.664647, -0.3192581),
    +      Vectors.dense(0.3390397, -0.3406099, 0.6870259),
    +      Vectors.dense(0.3665034, 0.1039416, 0.1484616))
    +
    +    import GeneralizedLinearRegression._
    +
    +    var idx = 0
    +
    +    for (fitIntercept <- Seq(false, true)) {
    +      for (family <- Seq("gaussian", "binomial", "poisson", "gamma", "tweedie")) {
    +        var trainer = new GeneralizedLinearRegression().setFamily(family)
    --- End diff --
    
    this does not need to be a var


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16699
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72162/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org