You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sethah <gi...@git.apache.org> on 2016/05/16 18:38:51 UTC

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/13139

    [SPARK-15186][ML][DOCS] Add user guide for generalized linear regression

    ## What changes were proposed in this pull request?
    
    This patch adds a user guide section for generalized linear regression and includes the examples from [#12754](https://github.com/apache/spark/pull/12754).
    
    ## How was this patch tested?
    
    Documentation only, no tests required.
    
    ## Approach
    
    In general, it is a bit unclear what level of detail ought to be included in the user guide since there is a lot of variability within the current user guide. I tried to give a fairly brief mathematical introduction to GLMs, and cover what types of problems they could be used for. Additionally, I included a brief blurb on the IRLS solver. The input/output columns are given in a table as is found elsewhere in the docs (though, again, these appear rather intermittently in the current docs), as well as a table providing the supported families and their link functions.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark SPARK-15186

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13139.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13139
    
----
commit e4a1e21e4121ba8d2e8d84464b794e8bdb0b9e54
Author: sethah <se...@gmail.com>
Date:   2016-05-09T14:46:46Z

    adding GLR user guide section

commit 990128db2afb3d9e7a7481f945ab30243f09a96f
Author: sethah <se...@gmail.com>
Date:   2016-05-16T18:13:02Z

    adding columns and families to description

commit 6de88b80d43720fe9e66c48205937b8fafded8d3
Author: sethah <se...@gmail.com>
Date:   2016-05-16T18:32:22Z

    cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220437291
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64802567
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    +$$
    +
    +where $\theta$ is the parameter of interest and $\tau$ is a dispersion parameter. In a GLM the response variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \tau \right)
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by
    +
    +$$
    +\mu_i = A'(\theta_i)
    +$$
    +
    +Here, $A'(\theta_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $A' = g^{-1}$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} h(y_i, \tau) \exp{\left(\frac{y_i\theta_i - A(\theta_i)}{d(\tau)}\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = A'(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    --- End diff --
    
    Good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64086909
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,148 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th>Family</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)).
    +The method of IRLS uses a first-order Taylor approximation of the score equation in the vicinity of an initial guess for the expected response
    + $\vec{\mu}$. This approximation can be manipulated to the form of a simple weighted least squares regression, which is straightforward
    + to solve using a normal equation solver. Solving this initial weighted least squares problem yields a (likely poor) approximation
    + to the regression coefficients $\vec{\beta}$. However, this approximation of $\vec{\beta}$ generates an improved approximation for $\vec{\mu}$
    + using the fact that $\vec{\mu} = g^{-1}(X\vec{\beta})$. In turn, an even more improved approximation to $\vec{\beta}$ can be found
    + solving the weighted least squares problem again. The true value of $\vec{\beta}$ is converged upon by repeatedly solving weighted least
    + squares problems in this manner (hence the name, iteratively weighted least squares).
    +
    + Note that solving the normal equations, as in a weighted least squares, for a linear system $A\vec{x} = \vec{b}$ involves 
    + inverting the covariance matrix $A^TA$. If $A$ is an $MxN$ matrix, then $A^TA$ has dimension $NxN$. When N is relatively
    + small (< 4096) then the covariance matrix can (generally) fit into main memory on the driver node and the linear system can
    + then be solved using well-established linear subroutines like the Cholesky decomposition. For this reason, it is important
    + to note that the `spark.ml` generalized linear regression module currently does not accept more than 4096 feature columns.
    +
    +**Example**
    +
    +The following example demonstrates training a GLM with a Gaussian response and identity link
    +function and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
    --- End diff --
    
    I am having an error with jekyll here, can someone verify?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220229727
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58842/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221421289
  
    @jkbradley I updated the notation to fall in line with Wikipedia. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220698289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59016/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64086020
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    --- End diff --
    
    You should put this paragraph first (it explains the purpose of GRL), and mention that all the supported families are listed below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221167413
  
    @yanboliang @sethah Could you please reconcile this PR with [https://github.com/apache/spark/pull/13262]?  Either option is OK with me.  If I had to choose, I'd put the optimization stuff in ml-advanced since most users will not need to know it.
    
    @sethah Where are you drawing your notation from?  If it's a source online, could you link to it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64790223
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    +$$
    +
    +where $\theta$ is the parameter of interest and $\tau$ is a dispersion parameter. In a GLM the response variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \tau \right)
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by
    +
    +$$
    +\mu_i = A'(\theta_i)
    +$$
    +
    +Here, $A'(\theta_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $A' = g^{-1}$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    --- End diff --
    
    max


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220437188
  
    **[Test build #58897 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58897/consoleFull)** for PR 13139 at commit [`d828909`](https://github.com/apache/spark/commit/d828909c850490c1f6c87f545fa40c3cb51649f3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64811972
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    --- End diff --
    
    Update: I reworded it. GLMs require subsets of the exponential family from the "natural exponential family." See [here](http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/glim.pdf) and [here](http://www.biostat.umn.edu/~dipankar/bmtry711.11/lecture_11.pdf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221423564
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59227/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221423561
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64328611
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,154 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features for GLM models, and will throw an exception if this 
    +constraint is exceeded. See the [optimization section](#optimization) for more details.
    +
    +In a GLM the resonse variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    --- End diff --
    
    phi should be defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63522591
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    --- End diff --
    
    Yeah this is a good point. I think we should come up with a clear standard on what goes into the user guide. Do we include input/output columns? parameters?
    
    The MLlib guide often included more detail on parameters in many cases. You've suggested elsewhere that including these makes it more difficult to maintain to ensure the params (and defaults etc) in the code match the use guide - and I tend to agree. These rather live in the API docs (to which we typically link in the ml guide).
    
    I'd probably say that if we go with the "check the API docs for all param details" then we should not bother with documenting the various input/output columns in the user guide, for the same reasons of maintainability.
    
    cc @jkbradley @mengxr @yanboliang @srowen for comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221367415
  
    The only difference here and the reference I linked to (I think) is that I replaced `a_i(\theta)` with its typical form, which they mention, `\phi / w_i`. I think matching Wikipedia is a good idea. I will work on translating the notation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63690666
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    --- End diff --
    
    I'd prefer to not include params here and provide link to the API docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63943165
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    --- End diff --
    
    Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64248278
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    --- End diff --
    
    Ok great, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220437293
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58897/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64328639
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,154 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features for GLM models, and will throw an exception if this 
    +constraint is exceeded. See the [optimization section](#optimization) for more details.
    +
    +In a GLM the resonse variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    --- End diff --
    
    Same for any other notation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221967252
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-219511959
  
    **[Test build #58653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58653/consoleFull)** for PR 13139 at commit [`6de88b8`](https://github.com/apache/spark/commit/6de88b80d43720fe9e66c48205937b8fafded8d3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63890670
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    --- End diff --
    
    I would prefer not to. Summary statistics are relatively easy to add and so this could change rather frequently. We should let the examples document them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220229725
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64779485
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,148 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th>Family</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)).
    +The method of IRLS uses a first-order Taylor approximation of the score equation in the vicinity of an initial guess for the expected response
    + $\vec{\mu}$. This approximation can be manipulated to the form of a simple weighted least squares regression, which is straightforward
    + to solve using a normal equation solver. Solving this initial weighted least squares problem yields a (likely poor) approximation
    + to the regression coefficients $\vec{\beta}$. However, this approximation of $\vec{\beta}$ generates an improved approximation for $\vec{\mu}$
    + using the fact that $\vec{\mu} = g^{-1}(X\vec{\beta})$. In turn, an even more improved approximation to $\vec{\beta}$ can be found
    + solving the weighted least squares problem again. The true value of $\vec{\beta}$ is converged upon by repeatedly solving weighted least
    + squares problems in this manner (hence the name, iteratively weighted least squares).
    +
    + Note that solving the normal equations, as in a weighted least squares, for a linear system $A\vec{x} = \vec{b}$ involves 
    + inverting the covariance matrix $A^TA$. If $A$ is an $MxN$ matrix, then $A^TA$ has dimension $NxN$. When N is relatively
    + small (< 4096) then the covariance matrix can (generally) fit into main memory on the driver node and the linear system can
    + then be solved using well-established linear subroutines like the Cholesky decomposition. For this reason, it is important
    + to note that the `spark.ml` generalized linear regression module currently does not accept more than 4096 feature columns.
    +
    +**Example**
    +
    +The following example demonstrates training a GLM with a Gaussian response and identity link
    +function and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
    --- End diff --
    
    Looks like it is working now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221979783
  
    **[Test build #59410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59410/consoleFull)** for PR 13139 at commit [`6e7ddd3`](https://github.com/apache/spark/commit/6e7ddd3f9e91b11979ab704438851ea4d8d99f67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220698288
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-219512077
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58653/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64790218
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    --- End diff --
    
    T is not defined (and is discarded below when you mention max likelihood)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220228793
  
    **[Test build #58842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58842/consoleFull)** for PR 13139 at commit [`ce7c55e`](https://github.com/apache/spark/commit/ce7c55e14a76dc85bca51a2563d770e3eac3a2a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64116968
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    --- End diff --
    
    I changed it up a bit. Let me know if that flows better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220434968
  
    **[Test build #58897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58897/consoleFull)** for PR 13139 at commit [`d828909`](https://github.com/apache/spark/commit/d828909c850490c1f6c87f545fa40c3cb51649f3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63521258
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>labelCol</td>
    +      <td>Double</td>
    +      <td>"label"</td>
    +      <td>Label to predict</td>
    +    </tr>
    +    <tr>
    +      <td>featuresCol</td>
    +      <td>Vector</td>
    +      <td>"features"</td>
    +      <td>Feature vector</td>
    +    </tr>
    +    <tr>
    +      <td>weightCol</td>
    +      <td>Double</td>
    +      <td>""</td>
    +      <td>Sample weights</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +### Output Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>predictionCol</td>
    +      <td>Double</td>
    +      <td>"prediction"</td>
    +      <td>Predicted label</td>
    +    </tr>
    +    <tr>
    +      <td>linkPredictionCol</td>
    +      <td>Double</td>
    +      <td>""</td>
    +      <td>Linear predicted response</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +
    +**Example**
    +
    +The following example demonstrates training a GLM with a Gaussian response and identity link
    +function and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
    --- End diff --
    
    Other examples usually include a link to the API docs also in this section


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63689827
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    --- End diff --
    
    Should we list all statistic summary here to let users know?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-222227527
  
    @sethah Thanks for the careful checks & writing.  After all this, perhaps it was a pretty tall order to write a brief explanation of GLMs!
    
    LGTM pending tests, which I'll rerun now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-222227619
  
    **[Test build #3025 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3025/consoleFull)** for PR 13139 at commit [`6e7ddd3`](https://github.com/apache/spark/commit/6e7ddd3f9e91b11979ab704438851ea4d8d99f67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63520885
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    --- End diff --
    
    I think in this section we should reiterate that it only works for lower feature dimensions (<4096). We've mentioned that above, but I think it should be made clear _where_ the constraint applies (i.e. it is actually in the solver).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63689657
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    --- End diff --
    
    Yeah. And I think we should also have more description of IRLS such as each step of IRLS is to solve WLS problems with normal equations method. I think I can contribute a separate PR to document IRLS and WLS solver in Spark. Then feel free to refer that directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221423464
  
    **[Test build #59227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59227/consoleFull)** for PR 13139 at commit [`e5eb583`](https://github.com/apache/spark/commit/e5eb583721ea9c560e87186bf9bbd1e13e527c84).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63688193
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    --- End diff --
    
    Or document the supported distribution here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221981865
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59410/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64327019
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,154 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features for GLM models, and will throw an exception if this 
    +constraint is exceeded. See the [optimization section](#optimization) for more details.
    --- End diff --
    
    Note that, for certain models, you can call LinearRegression or LogisticRegression to use other solvers which support more features.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220229641
  
    **[Test build #58842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58842/consoleFull)** for PR 13139 at commit [`ce7c55e`](https://github.com/apache/spark/commit/ce7c55e14a76dc85bca51a2563d770e3eac3a2a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221421942
  
    **[Test build #59227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59227/consoleFull)** for PR 13139 at commit [`e5eb583`](https://github.com/apache/spark/commit/e5eb583721ea9c560e87186bf9bbd1e13e527c84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220231329
  
    **[Test build #58843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58843/consoleFull)** for PR 13139 at commit [`e0079d0`](https://github.com/apache/spark/commit/e0079d03f279dc68eb19faed6d5cb6823802051a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64790228
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    +$$
    +
    +where $\theta$ is the parameter of interest and $\tau$ is a dispersion parameter. In a GLM the response variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \tau \right)
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by
    +
    +$$
    +\mu_i = A'(\theta_i)
    +$$
    +
    +Here, $A'(\theta_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $A' = g^{-1}$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} h(y_i, \tau) \exp{\left(\frac{y_i\theta_i - A(\theta_i)}{d(\tau)}\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = A'(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    --- End diff --
    
    Should A' be inverted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221981860
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221300070
  
    **[Test build #59203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59203/consoleFull)** for PR 13139 at commit [`bd30608`](https://github.com/apache/spark/commit/bd306083c312e3e2e86dd5ad09140cebd152f18c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63402784
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    --- End diff --
    
    I am not sure if it's necessary to include these formulas which are easily found via search. I lean towards not including them, but added them initially since they are easy to take out. Feedback appreciated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-219512074
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-222238875
  
    Merging with master and branch-2.0
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64803173
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    --- End diff --
    
    So, it seems as though every source on the internet, academic and otherwise, explains GLMs/exponential families differently with different notation and different terminology. My understanding is that GLMs usually work with an exponential family in its "natural" form, which is a transformed version of an even more generic specification of exponential families. Most _every_ resource I find **besides** wikipedia assumes this "natural" form and does not even mention it. So the `T(y)` appears sometimes, but mostly not. I think the updated explanation is correct, but please let me know if you think it could be clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221965040
  
    **[Test build #59402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59402/consoleFull)** for PR 13139 at commit [`a00a470`](https://github.com/apache/spark/commit/a00a47044b109f17cf8a818dd8a2d4c138fc4c5a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221281422
  
    @jkbradley @sethah I vote to put IRLS session in ml-advance, MLlib IRLS can be used for not only GLM but also other optimization problems such as robust regression in the future. So it's more appropriate to be put in the common session shared by all algorithms rather than in a specific one.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64089774
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,148 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th>Family</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)).
    +The method of IRLS uses a first-order Taylor approximation of the score equation in the vicinity of an initial guess for the expected response
    + $\vec{\mu}$. This approximation can be manipulated to the form of a simple weighted least squares regression, which is straightforward
    + to solve using a normal equation solver. Solving this initial weighted least squares problem yields a (likely poor) approximation
    + to the regression coefficients $\vec{\beta}$. However, this approximation of $\vec{\beta}$ generates an improved approximation for $\vec{\mu}$
    + using the fact that $\vec{\mu} = g^{-1}(X\vec{\beta})$. In turn, an even more improved approximation to $\vec{\beta}$ can be found
    + solving the weighted least squares problem again. The true value of $\vec{\beta}$ is converged upon by repeatedly solving weighted least
    + squares problems in this manner (hence the name, iteratively weighted least squares).
    +
    + Note that solving the normal equations, as in a weighted least squares, for a linear system $A\vec{x} = \vec{b}$ involves 
    + inverting the covariance matrix $A^TA$. If $A$ is an $MxN$ matrix, then $A^TA$ has dimension $NxN$. When N is relatively
    + small (< 4096) then the covariance matrix can (generally) fit into main memory on the driver node and the linear system can
    + then be solved using well-established linear subroutines like the Cholesky decomposition. For this reason, it is important
    + to note that the `spark.ml` generalized linear regression module currently does not accept more than 4096 feature columns.
    +
    +**Example**
    +
    +The following example demonstrates training a GLM with a Gaussian response and identity link
    +function and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
    --- End diff --
    
    Odd. It works for me. What is your error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63402981
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    --- End diff --
    
    These tables are in some sections and not others. Not sure what the decision criteria for including or not including them is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220231389
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-222229585
  
    **[Test build #3025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3025/consoleFull)** for PR 13139 at commit [`6e7ddd3`](https://github.com/apache/spark/commit/6e7ddd3f9e91b11979ab704438851ea4d8d99f67).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64327021
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,154 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features for GLM models, and will throw an exception if this 
    +constraint is exceeded. See the [optimization section](#optimization) for more details.
    +
    +In a GLM the resonse variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    --- End diff --
    
    typo: "response"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64408398
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,154 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features for GLM models, and will throw an exception if this 
    +constraint is exceeded. See the [optimization section](#optimization) for more details.
    +
    +In a GLM the resonse variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220005824
  
    @sethah I think we should also document the similarities and differences between Spark GLM and R glm/glmnet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220696033
  
    **[Test build #59016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59016/consoleFull)** for PR 13139 at commit [`264a490`](https://github.com/apache/spark/commit/264a490c4649e621539af4d88d07b8f3fff1dc2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220230302
  
    **[Test build #58843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58843/consoleFull)** for PR 13139 at commit [`e0079d0`](https://github.com/apache/spark/commit/e0079d03f279dc68eb19faed6d5cb6823802051a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63689022
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    --- End diff --
    
    I vote to remove the column ```PDF```. If users want to understand the meaning of it, they should also refer other documents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220698189
  
    **[Test build #59016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59016/consoleFull)** for PR 13139 at commit [`264a490`](https://github.com/apache/spark/commit/264a490c4649e621539af4d88d07b8f3fff1dc2a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220228661
  
    @yanboliang @MLnick Thanks for the feedback. For now, I've just addressed the comment about the optimization section. I'll address the other comments in my next commit (very soon!).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221290947
  
    @yanboliang SGTM. I'll update this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221981828
  
    **[Test build #59410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59410/consoleFull)** for PR 13139 at commit [`6e7ddd3`](https://github.com/apache/spark/commit/6e7ddd3f9e91b11979ab704438851ea4d8d99f67).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221967254
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59402/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63943262
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    --- End diff --
    
    I removed the sections for now. I like the idea of not duplicating information, but I'm open to others' thoughts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63943317
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) (IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum likelihood estimate of the
    +regression coefficients by finding zeros of the [score equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>labelCol</td>
    +      <td>Double</td>
    +      <td>"label"</td>
    +      <td>Label to predict</td>
    +    </tr>
    +    <tr>
    +      <td>featuresCol</td>
    +      <td>Vector</td>
    +      <td>"features"</td>
    +      <td>Feature vector</td>
    +    </tr>
    +    <tr>
    +      <td>weightCol</td>
    +      <td>Double</td>
    +      <td>""</td>
    +      <td>Sample weights</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +### Output Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>predictionCol</td>
    +      <td>Double</td>
    +      <td>"prediction"</td>
    +      <td>Predicted label</td>
    +    </tr>
    +    <tr>
    +      <td>linkPredictionCol</td>
    +      <td>Double</td>
    +      <td>""</td>
    +      <td>Linear predicted response</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +
    +**Example**
    +
    +The following example demonstrates training a GLM with a Gaussian response and identity link
    +function and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63823104
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    --- End diff --
    
    So, I went ahead and added some more detail on the optimization routine. I made an effort to stress the limitations on numFeatures and to give some explanation as to why. Could you take a look at it? I didn't generate the docs to make sure it looks alright just yet, but I wanted to get that up so it could be reviewed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220231390
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58843/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221355961
  
    @sethah Thanks for the updates.  You guessed correctly--I meant to add it to the user guide.  Which section of that guide are you following?  E.g., looking at the GLM theory section, they use slightly different notation.  I wonder if we should match Wikipedia instead since that will probably be the most commonly used reference.  What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221302208
  
    **[Test build #59203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59203/consoleFull)** for PR 13139 at commit [`bd30608`](https://github.com/apache/spark/commit/bd306083c312e3e2e86dd5ad09140cebd152f18c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64408275
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,154 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features for GLM models, and will throw an exception if this 
    +constraint is exceeded. See the [optimization section](#optimization) for more details.
    --- End diff --
    
    Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63890860
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, GLMs are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    --- End diff --
    
    The familes and link function are documented in the table below. I could move the table, or are you suggesting something somewhat different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64086417
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    --- End diff --
    
    See comment below. Also, it would be more clear to say that the current implementation can only deal with up to 4096 features, and that trying to use more features will lead to an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221967104
  
    **[Test build #59402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59402/consoleFull)** for PR 13139 at commit [`a00a470`](https://github.com/apache/spark/commit/a00a47044b109f17cf8a818dd8a2d4c138fc4c5a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-219509831
  
    **[Test build #58653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58653/consoleFull)** for PR 13139 at commit [`6de88b8`](https://github.com/apache/spark/commit/6de88b80d43720fe9e66c48205937b8fafded8d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221302345
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221302347
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59203/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-219509348
  
    cc @yanboliang @mengxr If you get a chance could you review this? Trying to get into Spark 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-220434180
  
    I addressed the review comments. Please let me know what else there is. Also, @yanboliang I'd be happy to add specific differences between R and Spark if you have some things in mind? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64802545
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a Gaussian
    +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_
    +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)}
    +$$
    +
    +where $\theta$ is the parameter of interest and $\tau$ is a dispersion parameter. In a GLM the response variable $Y_i$ is assumed to be drawn from an exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \tau \right)
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by
    +
    +$$
    +\mu_i = A'(\theta_i)
    +$$
    +
    +Here, $A'(\theta_i)$ is defined by the form of the exponential family distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $A' = g^{-1}$, which yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221945322
  
    Thanks!  Just a few comments



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63688136
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which can be used for various types of
    +prediction problems including linear regression, Poisson regression, logistic regression, and others.
    --- End diff --
    
    I think it's better to document which are Spark's GeneralizedLinearRegression supported.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/13139#issuecomment-221299931
  
    @jkbradley Thanks for the review! I wasn't sure if you wanted me to link to the reference I'm using in the actual user guide, or just here for reference. I added it to the user guide for now. Let me know if there is anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13139


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org