You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by hhbyyh <gi...@git.apache.org> on 2016/03/07 00:30:10 UTC

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/11549

    [SPARK-12566] [ML] [WIP] GLM model family, link function support in SparkR:::glm

    ## What changes were proposed in this pull request?
    
    This JIRA is for extending the support of MLlib's Generalized Linear Models (GLMs) to more model families and link functions in SparkR. After SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR with support of popular families and link functions.
    
    
    ## How was this patch tested?
    
    WIP, some manual test
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark glmR

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11549.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11549
    
----
commit 6c933650389798f8e3caf3e50604bceae79a126e
Author: Yuhao Yang <hh...@gmail.com>
Date:   2016-03-06T23:00:44Z

    change R glm to use GLM

commit 1cca19e68d9ef256769594e02d123ce6e3b0bd7d
Author: Yuhao Yang <hh...@gmail.com>
Date:   2016-03-06T23:27:58Z

    refine family

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-193015522
  
    Since we already have a `glm` in SparkR which is based on `LogisticRegressionModel` and `LinearRegressionModel`. There're three ways to extend it as I understand:
    
    1. Change the current glm to use `GeneralizedLinearRegression`. Create another `lm` interface for sparkR, and use LR as the model. 
    2. Keep glm R interface. and replace its implementation with GLM. This means R can not invoke LR anymore.
    2. Keep glm R interface, and combine the implementation with both LR and GLM based on different solver parameter.
    I'd prefer to use option 1. And I'm gonna send one PR(WIP) for solution 2, which can later be adjusted to 1 or 3.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-193020984
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52534/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r56290850
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
    @@ -17,15 +17,41 @@
     
     package org.apache.spark.ml.api.r
     
    +import org.apache.spark.SparkException
     import org.apache.spark.ml.{Pipeline, PipelineModel}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
     import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
     import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
    -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
    +import org.apache.spark.ml.regression._
     import org.apache.spark.sql.DataFrame
     
     private[r] object SparkRWrappers {
    +  def fitGLM(
    +      value: String,
    +      df: DataFrame,
    +      family: String,
    +      lambda: Double,
    +      solver: String): PipelineModel = {
    +    if (solver.trim != "irls") throw new SparkException("Currently only support irls")
    +
    +    val formula = new RFormula().setFormula(value)
    +    val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
    --- End diff --
    
    Do not use regex here. Extract the names on R side:
    
    ~~~R
    > b <- binomial(link = "logit")
    > b$family
    [1] "binomial"
    > b$link
    [1] "logit"
    ~~~



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r55597240
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -569,9 +572,46 @@ class GeneralizedLinearRegressionModel private[ml] (
         familyAndLink.fitted(eta)
       }
     
    +  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None
    +
    +  private[regression] def setSummary(summary: GeneralizedLinearRegressionSummary): this.type = {
    +    this.trainingSummary = Some(summary)
    +    this
    +  }
    +
    +  /**
    +    * Gets summary of model on training set. An exception is
    +    * thrown if `trainingSummary == None`.
    +    */
    +  @Since("2.0.0")
    +  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
    +    case Some(summ) => summ
    +    case None =>
    +      throw new SparkException(
    +        "No training summary available for this GeneralizedLinearRegressionModel",
    +        new NullPointerException())
    +  }
    +
       @Since("2.0.0")
       override def copy(extra: ParamMap): GeneralizedLinearRegressionModel = {
         copyValues(new GeneralizedLinearRegressionModel(uid, coefficients, intercept), extra)
           .setParent(parent)
       }
     }
    +
    +/**
    + * :: Experimental ::
    + * GeneralizedLinearRegressionModel results evaluated on a dataset.
    + *
    + * @param predictions dataframe outputted by the model's `transform` method.
    + * @param predictionCol field in "predictions" which gives the prediction of each instance.
    + * @param labelCol field in "predictions" which gives the true label of each instance.
    + * @param featuresCol field in "predictions" which gives the features of each instance as a vector.
    + */
    +@Experimental
    +@Since("2.0.0")
    +class GeneralizedLinearRegressionSummary private[regression] (
    +    @Since("2.0.0") @transient val predictions: DataFrame,
    +    @Since("2.0.0") val predictionCol: String,
    --- End diff --
    
    Based on a private discussion with @jkbradley , we should not expose the name of the columns in the public API. Users should be referring to the original model to get access to the column. In `LinearRegressionSummary`, they are passed around because they are required for metrics. We do not need them here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197791218
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53410/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r56290601
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -51,13 +45,12 @@ setClass("PipelineModel", representation(model = "jobj"))
     #' summary(model)
     #'}
     setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"),
    -          function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0,
    -            standardize = TRUE, solver = "auto") {
    +          function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data,
    --- End diff --
    
    We should match R's signature for `family` now. Internally, we can support family/link functions R supports. If user input `binomial("logit")`, we can get the family name and the link name before we call the Scala implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-199541034
  
    Looks like there are breaking signature changes - should we document that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197791215
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-208216908
  
    @hhbyyh I sent #12294 , please feel free to comment. Thanks! Would you mind to close this PR? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r55563547
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
    @@ -17,15 +17,41 @@
     
     package org.apache.spark.ml.api.r
     
    +import org.apache.spark.SparkException
     import org.apache.spark.ml.{Pipeline, PipelineModel}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
     import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
     import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
    -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
    +import org.apache.spark.ml.regression._
     import org.apache.spark.sql.DataFrame
     
     private[r] object SparkRWrappers {
    +  def fitGLM(
    +      value: String,
    +      df: DataFrame,
    +      family: String,
    +      lambda: Double,
    +      solver: String): PipelineModel = {
    +    if (solver.trim != "irls") throw new SparkException("Currently only support irls")
    +
    +    val formula = new RFormula().setFormula(value)
    +    val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
    --- End diff --
    
    in order to minimize the escaping, you can use Scala's raw strings:
    ```scala
    """^\s*(\w+...\s*$""".r
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r55595546
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala ---
    @@ -569,9 +572,46 @@ class GeneralizedLinearRegressionModel private[ml] (
         familyAndLink.fitted(eta)
       }
     
    +  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None
    +
    +  private[regression] def setSummary(summary: GeneralizedLinearRegressionSummary): this.type = {
    +    this.trainingSummary = Some(summary)
    +    this
    +  }
    +
    +  /**
    +    * Gets summary of model on training set. An exception is
    +    * thrown if `trainingSummary == None`.
    +    */
    +  @Since("2.0.0")
    +  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
    --- End diff --
    
    nit: you can also do 
    ```scala
    trainingSummary.getOrElse {
      throw new Exception(...)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-194528019
  
    @hhbyyh thanks! I just have some small comments; my main comment being in the jira ticket regarding the choice of options 1/2/3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r56290662
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -29,15 +29,9 @@ setClass("PipelineModel", representation(model = "jobj"))
     #' @param formula A symbolic description of the model to be fitted. Currently only a few formula
     #'                operators are supported, including '~', '.', ':', '+', and '-'.
     #' @param data DataFrame for training
    -#' @param family Error distribution. "gaussian" -> linear regression, "binomial" -> logistic reg.
    +#' @param family a description of the error distribution and link function to be used in the model..
    --- End diff --
    
    link to R's `family`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-205629238
  
    Thanks @jkbradley. We cannot decide which options to go. I think @yanboliang and @thunterdb both would like to go with option 3, yet there're more details to be decided about the mapping between family, link, solver and glm/lm implementations.
    
    Actually I think it may be more efficient if one of the committers can take lead on this. This is more like a strategic decision rather than code wrapper. I would not mind close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11549#discussion_r56907102

--- Diff: R/pkg/R/mllib.R ---
@@ -29,15 +29,10 @@ setClass("PipelineModel", representation(model = "jobj"))
#' @param formula A symbolic description of the model to be fitted. Currently only a few formula
#' operators are supported, including '~', '.', ':', '+', and '-'.
#' @param data DataFrame for training
-#' @param family Error distribution. "gaussian" -> linear regression, "binomial" -> logistic reg.
+#' @param family a description of the error distribution and link function to be used in the model,
+#' as in [[https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html]]
#' @param lambda Regularization parameter
-#' @param alpha Elastic-net mixing parameter (see glmnet's documentation for details)
-#' @param standardize Whether to standardize features before training
-#' @param solver The solver algorithm used for optimization, this can be "l-bfgs", "normal" and
-#' "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory
-#' quasi-Newton optimization method. "normal" denotes using Normal Equation as an
-#' analytical solution to the linear regression problem. The default value is "auto"
-#' which means that the solver algorithm is selected automatically.
+#' @param solver Currently only support "irls" which is also the default solver.
--- End diff --

The previous comment was more explicit, especially with respect to 'auto' (the default). It should mention auto and irls as the two options.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-193020961
  
    **[Test build #52534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52534/consoleFull)** for PR 11549 at commit [`1cca19e`](https://github.com/apache/spark/commit/1cca19e68d9ef256769594e02d123ce6e3b0bd7d).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197905850
  
    @hhbyyh I vote option 3 in JIRA. We already have ```GeneralizedLinearRegression``` in Scala, so it's better to call this implementation from SparkR directly. Due to ```GeneralizedLinearRegression``` can not handle dataset with more than 4096 features, so it should call ```LinearRegression``` and ```LogisticRegression``` with "l-bfgs" solver currently. Actually for ```gaussian``` family, "normal" solver equals to "irls" solver, we can unify them as "irls" and deprecated "normal" solver. 
    I think We should support ```GeneralizedLinearRegression```'s solver can be switched between ```auto, irls, l-bfgs``` at Scala side in a separate PR. After that the SparkR::glm can directly call the Scala implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r56908061
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
    @@ -17,15 +17,34 @@
     
     package org.apache.spark.ml.api.r
     
    +import org.apache.spark.SparkException
     import org.apache.spark.ml.{Pipeline, PipelineModel}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
     import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
     import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
    -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
    +import org.apache.spark.ml.regression._
     import org.apache.spark.sql.DataFrame
     
     private[r] object SparkRWrappers {
    +  def fitGLM(
    +      value: String,
    +      df: DataFrame,
    +      family: String,
    +      link: String,
    +      lambda: Double,
    +      solver: String): PipelineModel = {
    +    val formula = new RFormula().setFormula(value)
    +    val estimator = new GeneralizedLinearRegression()
    +      .setFamily(family)
    +      .setRegParam(lambda)
    +      .setFitIntercept(formula.hasIntercept)
    +
    +    if (link != null) estimator.setLink(link)
    +    val pipeline = new Pipeline().setStages(Array(formula, estimator))
    +    pipeline.fit(df)
    +  }
    +
       def fitRModelFormula(
    --- End diff --
    
    that method is not used anymore, right? We should remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r55592623
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
    @@ -17,15 +17,41 @@
     
     package org.apache.spark.ml.api.r
     
    +import org.apache.spark.SparkException
     import org.apache.spark.ml.{Pipeline, PipelineModel}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
     import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
     import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
    -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
    +import org.apache.spark.ml.regression._
     import org.apache.spark.sql.DataFrame
     
     private[r] object SparkRWrappers {
    +  def fitGLM(
    +      value: String,
    +      df: DataFrame,
    +      family: String,
    +      lambda: Double,
    +      solver: String): PipelineModel = {
    +    if (solver.trim != "irls") throw new SparkException("Currently only support irls")
    +
    +    val formula = new RFormula().setFormula(value)
    +    val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
    +    val estimator = family match {
    +      case regex(familyName, group2, linkName) =>
    --- End diff --
    
    I am confused: why do you need a regex here? I do not see anything special on the other side in R.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-206625152
  
    @hhbyyh @yanboliang I just wrote out my thoughts in the JIRA, and I think they match what @yanboliang suggested above (for option 3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-199516534
  
    @hhbyyh thanks for the split. I have two small comments. Also, can you include some tests in `test_mllib.R`? It should be very close to the manual testing you did and to the tests already in that file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197773071
  
    Thanks @mengxr @thunterdb @yanboliang for the review. Sent an update:
    1. resolve the conflict with GLMSummary.
    2. revert the summary statistics related part.
    3. extract family and link name in R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r56290664
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -51,13 +45,12 @@ setClass("PipelineModel", representation(model = "jobj"))
     #' summary(model)
     #'}
     setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"),
    -          function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0,
    -            standardize = TRUE, solver = "auto") {
    +          function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data,
    +              lambda = 0, solver = "irls") {
    --- End diff --
    
    +1 on "auto"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-193015610
  
    **[Test build #52534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52534/consoleFull)** for PR 11549 at commit [`1cca19e`](https://github.com/apache/spark/commit/1cca19e68d9ef256769594e02d123ce6e3b0bd7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-206641960
  
    @jkbradley Thanks for the suggestion.
    @yanboliang Please start work on this if you are interested, since it's your idea that matches the best solution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-193020982
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r56289063
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
    @@ -17,15 +17,41 @@
     
     package org.apache.spark.ml.api.r
     
    +import org.apache.spark.SparkException
     import org.apache.spark.ml.{Pipeline, PipelineModel}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
     import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
     import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
    -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
    +import org.apache.spark.ml.regression._
     import org.apache.spark.sql.DataFrame
     
     private[r] object SparkRWrappers {
    +  def fitGLM(
    +      value: String,
    +      df: DataFrame,
    +      family: String,
    +      lambda: Double,
    +      solver: String): PipelineModel = {
    +    if (solver.trim != "irls") throw new SparkException("Currently only support irls")
    +
    +    val formula = new RFormula().setFormula(value)
    +    val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
    +    val estimator = family match {
    +      case regex(familyName, group2, linkName) =>
    --- End diff --
    
    +1. The regex is unnecessary at here, ```RFormula``` can parse formula and handle illegal formula. 
    You may noticed that I use regex in #11447, that is because we only support a subset  of the formula in ```survreg``` currently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-205844978
  
    @jkbradley @hhbyyh I can work on this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11549#discussion_r55591400
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -51,13 +45,12 @@ setClass("PipelineModel", representation(model = "jobj"))
     #' summary(model)
     #'}
     setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"),
    -          function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0,
    -            standardize = TRUE, solver = "auto") {
    +          function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data,
    +              lambda = 0, solver = "irls") {
    --- End diff --
    
    I would keep the solver to 'auto', so that we can change the implementation of the solver without regression. However, the option 'irls' is available for users who want to use it.
    
    As for the other options for the solver, see my comment in the ticket.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197192755
  
    @yanboliang @hhbyyh Let us do the summary statistics under another JIRA: https://issues.apache.org/jira/browse/SPARK-13925


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197791173
  
    **[Test build #53410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53410/consoleFull)** for PR 11549 at commit [`8b3dd3e`](https://github.com/apache/spark/commit/8b3dd3eebd808bc2c8e139f55e20de9d54b5cda2).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197190040
  
    @hhbyyh #11694 has been merged, so we can provide all R-like summary statistic for ```glm```. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-197774045
  
    **[Test build #53410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53410/consoleFull)** for PR 11549 at commit [`8b3dd3e`](https://github.com/apache/spark/commit/8b3dd3eebd808bc2c8e139f55e20de9d54b5cda2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh closed the pull request at:

    https://github.com/apache/spark/pull/11549


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11549#issuecomment-205472279
  
    Ping!  Let me know if I can help get this in for 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org