You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by thunterdb <gi...@git.apache.org> on 2015/12/08 20:36:38 UTC

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

GitHub user thunterdb opened a pull request:

    https://github.com/apache/spark/pull/10207

    [SPARK-8517][MLLIB][DOC] Reorganizes the spark.ml user guide

    This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested.
    
    <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png">


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thunterdb/spark spark-8517

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10207
    
----
commit af8abb9713f90478a1a0b81fdc83192192354c30
Author: Timothy Hunter <ti...@databricks.com>
Date:   2015-12-08T18:23:06Z

    moved content

commit 451b7737f553fbc425ce2144fe0930b885874c7f
Author: Timothy Hunter <ti...@databricks.com>
Date:   2015-12-08T19:32:22Z

    reordered doc

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163037821
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163002087
  
    Remove empty "ml-examples.md" file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-162996582
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47039755
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Classification with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    --- End diff --
    
    Example heading (here and everywhere else)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47039758
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Classification with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example python/ml/decision_tree_classification_example.py %}
    +
    +</div>
    +
    +</div>
    +
    +## Classification with random forests
    --- End diff --
    
    "Random Forest Classifier"  (same for other tree/ensemble headers below)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47018195
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,733 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +## Example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/aft_survival_regression.py %}
    +</div>
    +
    +</div>
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Multilayer perceptron classifier
    +
    +Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). 
    +MLPC consists of multiple layers of nodes. 
    +Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs 
    +by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function. 
    +It can be written in matrix form for MLPC with `$K+1$` layers as follows:
    +`\[
    +\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
    +\]`
    +Nodes in intermediate layers use sigmoid (logistic) function:
    +`\[
    +\mathrm{f}(z_i) = \frac{1}{1 + e^{-z_i}}
    +\]`
    +Nodes in the output layer use softmax function:
    +`\[
    +\mathrm{f}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}
    +\]`
    +The number of nodes `$N$` in the output layer corresponds to the number of classes. 
    +
    +MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
    +
    +**Examples**
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaMultilayerPerceptronClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/multilayer_perceptron_classification.py %}
    +</div>
    +
    +</div>
    +
    +
    +## One-vs-Rest classifier (a.k.a. One-vs-All)
    +
    +[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as "One-vs-All."
    +
    +`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
    +
    +Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.
    +
    +### Example
    +
    +The example below demonstrates how to load the
    +[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classifier.OneVsRest) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/OneVsRest.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaOneVsRestExample.java %}
    +</div>
    +</div>
    +
    +
    +
    +# Decision trees
    --- End diff --
    
    Done. I put a short paragraph with a link on top of each section about regression/classification of trees/RFs/GBTs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-162992990
  
    **[Test build #47358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47358/consoleFull)** for PR 10207 at commit [`451b773`](https://github.com/apache/spark/commit/451b7737f553fbc425ce2144fe0930b885874c7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163087452
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006493
  
    --- Diff: docs/_data/menu-ml.yaml ---
    @@ -1,10 +1,10 @@
    -- text: Feature extraction, transformation, and selection
    +- text: "Overview: estimators, transformers and pipelines"
    +  url: ml-intro.html
    +- text: Building and transforming features
    --- End diff --
    
    I like using the keywords "extraction, transformation, and selection" since users may search for those.  "Building" is pretty generic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163087316
  
    **[Test build #2187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2187/consoleFull)** for PR 10207 at commit [`dc584b2`](https://github.com/apache/spark/commit/dc584b26e7c6c9e0bdab4e304377934adc015505).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as \"One-vs-All.\"`\n  * `[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.`\n  * `The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).`\n  * `* a bit more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional 
 probabilities) for classification.`\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47043656
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Classification with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example python/ml/decision_tree_classification_example.py %}
    +
    +</div>
    +
    +</div>
    +
    +## Classification with random forests
    +
    +Random forests are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/RandomForestClassifier.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier) for more details.
    +
    +{% include_example python/ml/random_forest_classifier_example.py %}
    +</div>
    +</div>
    +
    +## Classification with gradient-boosted trees
    +
    +Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. 
    +More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.GBTClassifier) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/GBTClassifier.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classification.GBTClassifier) for more details.
    +
    +{% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
    +</div>
    +</div>
    +
    +## Multilayer perceptron classifier
    +
    +Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). 
    +MLPC consists of multiple layers of nodes. 
    +Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs 
    +by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function. 
    +It can be written in matrix form for MLPC with `$K+1$` layers as follows:
    +`\[
    +\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
    +\]`
    +Nodes in intermediate layers use sigmoid (logistic) function:
    +`\[
    +\mathrm{f}(z_i) = \frac{1}{1 + e^{-z_i}}
    +\]`
    +Nodes in the output layer use softmax function:
    +`\[
    +\mathrm{f}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}
    +\]`
    +The number of nodes `$N$` in the output layer corresponds to the number of classes. 
    +
    +MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
    +
    +**Examples**
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaMultilayerPerceptronClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/multilayer_perceptron_classification.py %}
    +</div>
    +
    +</div>
    +
    +
    +## One-vs-Rest classifier (a.k.a. One-vs-All)
    +
    +[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as "One-vs-All."
    +
    +`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
    +
    +Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.
    +
    +### Example
    +
    +The example below demonstrates how to load the
    +[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classifier.OneVsRest) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/OneVsRest.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaOneVsRestExample.java %}
    +</div>
    +</div>
    +
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +
    +## Regression with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor).
    +
    +{% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/regression/DecisionTreeRegressor.html).
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor).
    +
    +{% include_example python/ml/decision_tree_regression_example.py %}
    +</div>
    +
    +</div>
    +
    +
    +## Regression with random forests
    +
    +Random forests are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/regression/RandomForestRegressor.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaRandomForestRegressorExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor) for more details.
    +
    +{% include_example python/ml/random_forest_regressor_example.py %}
    +</div>
    +</div>
    +
    +## Regression with gradient-boosted trees
    +
    +Gradient-boosted trees (GBTs) are a popular regression method using ensembles of decision trees. 
    +More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).
    +
    +Note: For this example dataset, `GBTRegressor` actually only needs 1 iteration, but that will not
    +be true in general.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GBTRegressor) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/regression/GBTRegressor.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaGradientBoostedTreeRegressorExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.GBTRegressor) for more details.
    +
    +{% include_example python/ml/gradient_boosted_tree_regressor_example.py %}
    +</div>
    +</div>
    +
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +### Survival regression example
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/aft_survival_regression.py %}
    +</div>
    +
    +</div>
    +
    +
    +
    +# Decision trees
    +
    +[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
    +and their ensembles are popular methods for the machine learning tasks of
    +classification and regression. Decision trees are widely used since they are easy to interpret,
    +handle categorical features, extend to the multiclass classification setting, do not require
    +feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble
    +algorithms such as random forests and boosting are among the top performers for classification and
    +regression tasks.
    +
    +MLlib supports decision trees for binary and multiclass classification and for regression,
    +using both continuous and categorical features. The implementation partitions data by rows,
    +allowing distributed training with millions or even billions of instances.
    +
    +Users can find more information about the decision tree algorithm in the [MLlib Decision Tree guide](mllib-decision-tree.html).  In this section, we demonstrate the Pipelines API for Decision Trees.
    +
    +The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).
    +
    +Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html).
    +
    +## Inputs and Outputs
    +
    +We list the input and output (prediction) column types here.
    +All output columns are optional; to exclude an output column, set its corresponding Param to an empty string.
    +
    +### Input Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>labelCol</td>
    +      <td>Double</td>
    +      <td>"label"</td>
    +      <td>Label to predict</td>
    +    </tr>
    +    <tr>
    +      <td>featuresCol</td>
    +      <td>Vector</td>
    +      <td>"features"</td>
    +      <td>Feature vector</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +### Output Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +      <th align="left">Notes</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>predictionCol</td>
    +      <td>Double</td>
    +      <td>"prediction"</td>
    +      <td>Predicted label</td>
    +      <td></td>
    +    </tr>
    +    <tr>
    +      <td>rawPredictionCol</td>
    +      <td>Vector</td>
    +      <td>"rawPrediction"</td>
    +      <td>Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction</td>
    +      <td>Classification only</td>
    +    </tr>
    +    <tr>
    +      <td>probabilityCol</td>
    +      <td>Vector</td>
    +      <td>"probability"</td>
    +      <td>Vector of length # classes equal to rawPrediction normalized to a multinomial distribution</td>
    +      <td>Classification only</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +## Examples
    +
    +The below examples demonstrate the Pipelines API for Decision Trees. The main differences between this API and the [original MLlib Decision Tree API](mllib-decision-tree.html) are:
    --- End diff --
    
    Yes, I missed this paragraph during the copy/paste. I removed this section and moved the explanations about the differences to the main section (`# Ensembles`, ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163037778
  
    **[Test build #47369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47369/consoleFull)** for PR 10207 at commit [`216acd3`](https://github.com/apache/spark/commit/216acd3a95afafb8c0f410ccdb1fd68c7768c5c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006661
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,733 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +## Example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/aft_survival_regression.py %}
    +</div>
    +
    +</div>
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +The following example shows how to train a logistic regression model
    --- End diff --
    
    This section should say it only applies to binary logreg currently but will support multiclass in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163087384
  
    **[Test build #47384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47384/consoleFull)** for PR 10207 at commit [`dc584b2`](https://github.com/apache/spark/commit/dc584b26e7c6c9e0bdab4e304377934adc015505).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as \"One-vs-All.\"`\n  * `[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.`\n  * `The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).`\n  * `* a bit more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional 
 probabilities) for classification.`\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163074701
  
    Thanks!  just a few comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-162996404
  
    **[Test build #47358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47358/consoleFull)** for PR 10207 at commit [`451b773`](https://github.com/apache/spark/commit/451b7737f553fbc425ce2144fe0930b885874c7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as \"One-vs-All.\"`\n  * `[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.`\n  * `The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).`\n  * `* a bit more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional 
 probabilities) for classification.`\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47039750
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Classification with decision trees
    --- End diff --
    
    "Decision Tree Classifier" (match method name)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47039742
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    --- End diff --
    
    Use "Example" heading


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163044753
  
    **[Test build #47369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47369/consoleFull)** for PR 10207 at commit [`216acd3`](https://github.com/apache/spark/commit/216acd3a95afafb8c0f410ccdb1fd68c7768c5c8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as \"One-vs-All.\"`\n  * `[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.`\n  * `The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).`\n  * `* a bit more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional 
 probabilities) for classification.`\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163037641
  
    **[Test build #47367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47367/consoleFull)** for PR 10207 at commit [`6c7850b`](https://github.com/apache/spark/commit/6c7850b5e655dba617092b703fb35077fb9cb339).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as \"One-vs-All.\"`\n  * `[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.`\n  * `The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).`\n  * `* a bit more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional 
 probabilities) for classification.`\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n  * `public class Document implements Serializable `\n  * `public class LabeledDocument extends Document implements Serializable `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163045054
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006503
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,733 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +# Regression
    --- End diff --
    
    I'd put Classification before Regression since I'd guess Classification is more commonly used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-162994801
  
    I'd use the "[ML]" tag in the PR title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47017809
  
    --- Diff: docs/mllib-guide.md ---
    @@ -66,15 +66,18 @@ We list major functionality from both below, with links to detailed guides.
     
     # spark.ml: high-level APIs for ML pipelines
     
    -**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major
    -concepts. It also contains sections on using algorithms within the Pipelines API, for example:
    -
    -* [Feature extraction, transformation, and selection](ml-features.html)
    +* [Overview: estimators, transformers and pipelines](ml-intro.html)
    +* [Building and transforming features](ml-features.html)
    +* [Classification and regression](ml-classification-regression.html)
     * [Clustering](ml-clustering.html)
    -* [Decision trees for classification and regression](ml-decision-tree.html)
    -* [Ensembles](ml-ensembles.html)
    -* [Linear methods with elastic net regularization](ml-linear-methods.html)
    -* [Multilayer perceptron classifier](ml-ann.html)
    +* [Advanced topics](ml-advanced.html)
    +
    +Some techniques are not available yet in spark.ml, most notably:
    + - clustering
    --- End diff --
    
    Oh yes true, of course. I rephrased it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006653
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,733 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +## Example:
    --- End diff --
    
    Use a standard heading for example sections.  I'd suggest a smaller heading, with no colon unless there is text following.  (I like seeing examples in the TOC.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163009350
  
    @jkbradley no this PR just moves the text around, with little modification. More substantital changes will be done later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163085255
  
    **[Test build #2187 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2187/consoleFull)** for PR 10207 at commit [`dc584b2`](https://github.com/apache/spark/commit/dc584b26e7c6c9e0bdab4e304377934adc015505).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47039793
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Classification with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example python/ml/decision_tree_classification_example.py %}
    +
    +</div>
    +
    +</div>
    +
    +## Classification with random forests
    +
    +Random forests are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/RandomForestClassifier.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier) for more details.
    +
    +{% include_example python/ml/random_forest_classifier_example.py %}
    +</div>
    +</div>
    +
    +## Classification with gradient-boosted trees
    +
    +Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. 
    +More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.GBTClassifier) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/GBTClassifier.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classification.GBTClassifier) for more details.
    +
    +{% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
    +</div>
    +</div>
    +
    +## Multilayer perceptron classifier
    +
    +Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). 
    +MLPC consists of multiple layers of nodes. 
    +Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs 
    +by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function. 
    +It can be written in matrix form for MLPC with `$K+1$` layers as follows:
    +`\[
    +\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
    +\]`
    +Nodes in intermediate layers use sigmoid (logistic) function:
    +`\[
    +\mathrm{f}(z_i) = \frac{1}{1 + e^{-z_i}}
    +\]`
    +Nodes in the output layer use softmax function:
    +`\[
    +\mathrm{f}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}
    +\]`
    +The number of nodes `$N$` in the output layer corresponds to the number of classes. 
    +
    +MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
    +
    +**Examples**
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaMultilayerPerceptronClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/multilayer_perceptron_classification.py %}
    +</div>
    +
    +</div>
    +
    +
    +## One-vs-Rest classifier (a.k.a. One-vs-All)
    +
    +[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as "One-vs-All."
    +
    +`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
    +
    +Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.
    +
    +### Example
    +
    +The example below demonstrates how to load the
    +[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classifier.OneVsRest) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/OneVsRest.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaOneVsRestExample.java %}
    +</div>
    +</div>
    +
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +
    +## Regression with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor).
    +
    +{% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/regression/DecisionTreeRegressor.html).
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor).
    +
    +{% include_example python/ml/decision_tree_regression_example.py %}
    +</div>
    +
    +</div>
    +
    +
    +## Regression with random forests
    +
    +Random forests are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/regression/RandomForestRegressor.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaRandomForestRegressorExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor) for more details.
    +
    +{% include_example python/ml/random_forest_regressor_example.py %}
    +</div>
    +</div>
    +
    +## Regression with gradient-boosted trees
    +
    +Gradient-boosted trees (GBTs) are a popular regression method using ensembles of decision trees. 
    +More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).
    +
    +Note: For this example dataset, `GBTRegressor` actually only needs 1 iteration, but that will not
    +be true in general.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GBTRegressor) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/regression/GBTRegressor.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaGradientBoostedTreeRegressorExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.GBTRegressor) for more details.
    +
    +{% include_example python/ml/gradient_boosted_tree_regressor_example.py %}
    +</div>
    +</div>
    +
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +### Survival regression example
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/aft_survival_regression.py %}
    +</div>
    +
    +</div>
    +
    +
    +
    +# Decision trees
    +
    +[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
    +and their ensembles are popular methods for the machine learning tasks of
    +classification and regression. Decision trees are widely used since they are easy to interpret,
    +handle categorical features, extend to the multiclass classification setting, do not require
    +feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble
    +algorithms such as random forests and boosting are among the top performers for classification and
    +regression tasks.
    +
    +MLlib supports decision trees for binary and multiclass classification and for regression,
    +using both continuous and categorical features. The implementation partitions data by rows,
    +allowing distributed training with millions or even billions of instances.
    +
    +Users can find more information about the decision tree algorithm in the [MLlib Decision Tree guide](mllib-decision-tree.html).  In this section, we demonstrate the Pipelines API for Decision Trees.
    +
    +The Pipelines API for Decision Trees offers a bit more functionality than the original API.  In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).
    +
    +Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html).
    +
    +## Inputs and Outputs
    +
    +We list the input and output (prediction) column types here.
    +All output columns are optional; to exclude an output column, set its corresponding Param to an empty string.
    +
    +### Input Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>labelCol</td>
    +      <td>Double</td>
    +      <td>"label"</td>
    +      <td>Label to predict</td>
    +    </tr>
    +    <tr>
    +      <td>featuresCol</td>
    +      <td>Vector</td>
    +      <td>"features"</td>
    +      <td>Feature vector</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +### Output Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +      <th align="left">Notes</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>predictionCol</td>
    +      <td>Double</td>
    +      <td>"prediction"</td>
    +      <td>Predicted label</td>
    +      <td></td>
    +    </tr>
    +    <tr>
    +      <td>rawPredictionCol</td>
    +      <td>Vector</td>
    +      <td>"rawPrediction"</td>
    +      <td>Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction</td>
    +      <td>Classification only</td>
    +    </tr>
    +    <tr>
    +      <td>probabilityCol</td>
    +      <td>Vector</td>
    +      <td>"probability"</td>
    +      <td>Vector of length # classes equal to rawPrediction normalized to a multinomial distribution</td>
    +      <td>Classification only</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +## Examples
    +
    +The below examples demonstrate the Pipelines API for Decision Trees. The main differences between this API and the [original MLlib Decision Tree API](mllib-decision-tree.html) are:
    --- End diff --
    
    Need to update.  I won't comment on other text, but this one didn't make sense anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163088112
  
    Merging with master and branch-1.6


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47039765
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,762 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
    +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). 
    +
    +  > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Classification with decision trees
    +
    +Decision trees are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier).
    +
    +{% include_example python/ml/decision_tree_classification_example.py %}
    +
    +</div>
    +
    +</div>
    +
    +## Classification with random forests
    +
    +Random forests are a popular family of classification and regression methods.
    +More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/RandomForestClassifier.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaRandomForestClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier) for more details.
    +
    +{% include_example python/ml/random_forest_classifier_example.py %}
    +</div>
    +</div>
    +
    +## Classification with gradient-boosted trees
    +
    +Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. 
    +More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).
    +
    +The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
    +We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.GBTClassifier) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/GBTClassifier.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaGradientBoostedTreeClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classification.GBTClassifier) for more details.
    +
    +{% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
    +</div>
    +</div>
    +
    +## Multilayer perceptron classifier
    +
    +Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). 
    +MLPC consists of multiple layers of nodes. 
    +Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs 
    +by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function. 
    +It can be written in matrix form for MLPC with `$K+1$` layers as follows:
    +`\[
    +\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
    +\]`
    +Nodes in intermediate layers use sigmoid (logistic) function:
    +`\[
    +\mathrm{f}(z_i) = \frac{1}{1 + e^{-z_i}}
    +\]`
    +Nodes in the output layer use softmax function:
    +`\[
    +\mathrm{f}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}
    +\]`
    +The number of nodes `$N$` in the output layer corresponds to the number of classes. 
    +
    +MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
    +
    +**Examples**
    --- End diff --
    
    use header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163002290
  
    Those are my high-level comments.  Have you rewritten much text?  If so, I can do a second more detailed pass after updates (which will require restructuring).
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163085251
  
    **[Test build #47384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47384/consoleFull)** for PR 10207 at commit [`dc584b2`](https://github.com/apache/spark/commit/dc584b26e7c6c9e0bdab4e304377934adc015505).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163045057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47369/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-162995378
  
    Remove the empty "ml-pipelines.md" file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10207


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47018213
  
    --- Diff: docs/_data/menu-ml.yaml ---
    @@ -1,10 +1,10 @@
    -- text: Feature extraction, transformation, and selection
    +- text: "Overview: estimators, transformers and pipelines"
    +  url: ml-intro.html
    +- text: Building and transforming features
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163013229
  
    OK thanks just confirming


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163037823
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47367/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006683
  
    --- Diff: docs/mllib-guide.md ---
    @@ -66,15 +66,18 @@ We list major functionality from both below, with links to detailed guides.
     
     # spark.ml: high-level APIs for ML pipelines
     
    -**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major
    -concepts. It also contains sections on using algorithms within the Pipelines API, for example:
    -
    -* [Feature extraction, transformation, and selection](ml-features.html)
    +* [Overview: estimators, transformers and pipelines](ml-intro.html)
    +* [Building and transforming features](ml-features.html)
    +* [Classification and regression](ml-classification-regression.html)
     * [Clustering](ml-clustering.html)
    -* [Decision trees for classification and regression](ml-decision-tree.html)
    -* [Ensembles](ml-ensembles.html)
    -* [Linear methods with elastic net regularization](ml-linear-methods.html)
    -* [Multilayer perceptron classifier](ml-ann.html)
    +* [Advanced topics](ml-advanced.html)
    +
    +Some techniques are not available yet in spark.ml, most notably:
    + - clustering
    --- End diff --
    
    remove this & next line (no longer true)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163087454
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47384/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163034712
  
    **[Test build #47367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47367/consoleFull)** for PR 10207 at commit [`6c7850b`](https://github.com/apache/spark/commit/6c7850b5e655dba617092b703fb35077fb9cb339).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-162996585
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47358/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006676
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,733 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +## Example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/aft_survival_regression.py %}
    +</div>
    +
    +</div>
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
    +provides a summary for a
    +[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
    +Currently, only binary classification is supported and the
    +summary must be explicitly cast to
    +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
    +This will likely change when multiclass classification is supported.
    +
    +Continuing the earlier example:
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionSummaryExample.java %}
    +</div>
    +
    +<!--- TODO: Add python model summaries once implemented -->
    +<div data-lang="python" markdown="1">
    +Logistic regression model summary is not yet supported in Python.
    +</div>
    +
    +</div>
    +
    +
    +## Multilayer perceptron classifier
    +
    +Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network). 
    +MLPC consists of multiple layers of nodes. 
    +Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes maps inputs to the outputs 
    +by performing linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function. 
    +It can be written in matrix form for MLPC with `$K+1$` layers as follows:
    +`\[
    +\mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T \x+b_1)+b_2)...+b_K)
    +\]`
    +Nodes in intermediate layers use sigmoid (logistic) function:
    +`\[
    +\mathrm{f}(z_i) = \frac{1}{1 + e^{-z_i}}
    +\]`
    +Nodes in the output layer use softmax function:
    +`\[
    +\mathrm{f}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}
    +\]`
    +The number of nodes `$N$` in the output layer corresponds to the number of classes. 
    +
    +MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
    +
    +**Examples**
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaMultilayerPerceptronClassifierExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/multilayer_perceptron_classification.py %}
    +</div>
    +
    +</div>
    +
    +
    +## One-vs-Rest classifier (a.k.a. One-vs-All)
    +
    +[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.  It is also known as "One-vs-All."
    +
    +`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
    +
    +Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.
    +
    +### Example
    +
    +The example below demonstrates how to load the
    +[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classifier.OneVsRest) for more details.
    +
    +{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +Refer to the [Java API docs](api/java/org/apache/spark/ml/classification/OneVsRest.html) for more details.
    +
    +{% include_example java/org/apache/spark/examples/ml/JavaOneVsRestExample.java %}
    +</div>
    +</div>
    +
    +
    +
    +# Decision trees
    --- End diff --
    
    I'd prefer to split trees and ensembles into subsections of classification & regression.  General info about trees and ensembles could be put into a separate section, with links to it from the classification & regression subsections.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10207#issuecomment-163085055
  
    Thanks for updating it!  LGTM pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10207#discussion_r47006668
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -0,0 +1,733 @@
    +---
    +layout: global
    +title: Classification and regression - spark.ml
    +displayTitle: Classification and regression in spark.ml
    +---
    +
    +
    +`\[
    +\newcommand{\R}{\mathbb{R}}
    +\newcommand{\E}{\mathbb{E}}
    +\newcommand{\x}{\mathbf{x}}
    +\newcommand{\y}{\mathbf{y}}
    +\newcommand{\wv}{\mathbf{w}}
    +\newcommand{\av}{\mathbf{\alpha}}
    +\newcommand{\bv}{\mathbf{b}}
    +\newcommand{\N}{\mathbb{N}}
    +\newcommand{\id}{\mathbf{I}}
    +\newcommand{\ind}{\mathbf{1}}
    +\newcommand{\0}{\mathbf{0}}
    +\newcommand{\unit}{\mathbf{e}}
    +\newcommand{\one}{\mathbf{1}}
    +\newcommand{\zero}{\mathbf{0}}
    +\]`
    +
    +**Table of Contents**
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +In MLlib, we implement popular linear methods such as logistic
    +regression and linear least squares with $L_1$ or $L_2$ regularization.
    +Refer to [the linear methods in mllib](mllib-linear-methods.html) for
    +details.  In `spark.ml`, we also include Pipelines API for [Elastic
    +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
    +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
    +and variable selection via the elastic
    +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
    +Mathematically, it is defined as a convex combination of the $L_1$ and
    +the $L_2$ regularization terms:
    +`\[
    +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
    +\]`
    +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
    +regularization as special cases. For example, if a [linear
    +regression](https://en.wikipedia.org/wiki/Linear_regression) model is
    +trained with the elastic net parameter $\alpha$ set to $1$, it is
    +equivalent to a
    +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
    +On the other hand, if $\alpha$ is set to $0$, the trained model reduces
    +to a [ridge
    +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
    +We implement Pipelines API for both linear regression and logistic
    +regression with elastic net regularization.
    +
    +# Regression
    +
    +## Linear regression
    +
    +The interface for working with linear regression models and model
    +summaries is similar to the logistic regression case. The following
    +example demonstrates training an elastic net regularized linear
    +regression model and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +<!--- TODO: Add python model summaries once implemented -->
    +{% include_example python/ml/linear_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +## Survival regression
    +
    +
    +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
    +model which is a parametric survival regression model for censored data. 
    +It describes a model for the log of survival time, so it's often called 
    +log-linear model for survival analysis. Different from 
    +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
    +designed for the same purpose, the AFT model is more easily to parallelize 
    +because each instance contribute to the objective function independently.
    +
    +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
    +subjects i = 1, ..., n, with possible right-censoring, 
    +the likelihood function under the AFT model is given as:
    +`\[
    +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
    +\]`
    +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
    +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
    +assumes the form:
    +`\[
    +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
    +\]`
    +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
    +and $f_{0}(\epsilon_{i})$ is corresponding density function.
    +
    +The most commonly used AFT model is based on the Weibull distribution of the survival time. 
    +The Weibull distribution for lifetime corresponding to extreme value distribution for 
    +log of the lifetime, and the $S_{0}(\epsilon)$ function is:
    +`\[   
    +S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
    +\]`
    +the $f_{0}(\epsilon_{i})$ function is:
    +`\[
    +f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
    +\]`
    +The log-likelihood function for AFT model with Weibull distribution of lifetime is:
    +`\[
    +\iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
    +\]`
    +Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
    +the loss function we use to optimize is $-\iota(\beta,\sigma)$.
    +The gradient functions for $\beta$ and $\log\sigma$ respectively are:
    +`\[   
    +\frac{\partial (-\iota)}{\partial \beta}=\sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
    +\]`
    +`\[ 
    +\frac{\partial (-\iota)}{\partial (\log\sigma)}=\sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
    +\]`
    +
    +The AFT model can be formulated as a convex optimization problem, 
    +i.e. the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ 
    +that depends coefficients vector $\beta$ and the log of scale parameter $\log\sigma$.
    +The optimization algorithm underlying the implementation is L-BFGS.
    +The implementation matches the result from R's survival function 
    +[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
    +
    +## Example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/aft_survival_regression.py %}
    +</div>
    +
    +</div>
    +
    +
    +# Classification
    +
    +## Logistic regression
    +
    +The following example shows how to train a logistic regression model
    +with elastic net regularization. `elasticNetParam` corresponds to
    +$\alpha$ and `regParam` corresponds to $\lambda$.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% include_example python/ml/logistic_regression_with_elastic_net.py %}
    +</div>
    +
    +</div>
    +
    +The `spark.ml` implementation of logistic regression also supports
    +extracting a summary of the model over the training set. Note that the
    +predictions and metrics which are stored as `Dataframe` in
    +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
    +only available on the driver.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
    --- End diff --
    
    I'd start with a link to the LogisticRegression docs.  Then you can list these less important links.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org