You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by atalwalkar <gi...@git.apache.org> on 2014/08/12 20:53:04 UTC

[GitHub] spark pull request: SPARK-2830

GitHub user atalwalkar opened a pull request:

    https://github.com/apache/spark/pull/1908

    SPARK-2830

    As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/atalwalkar/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1908.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1908
    
----
commit 7ec366ae35ce30e1ebe700068066a09636b68c92
Author: Ameet Talwalkar <at...@gmail.com>
Date:   2014-08-12T18:48:58Z

    reorganize and edit mllib documentation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143682
  
    --- Diff: docs/mllib-guide.md ---
    @@ -3,18 +3,19 @@ layout: global
     title: Machine Learning Library (MLlib)
     ---
     
    -MLlib is a Spark implementation of some common machine learning algorithms and utilities,
    +MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
     including classification, regression, clustering, collaborative
    -filtering, dimensionality reduction, as well as underlying optimization primitives:
    +filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
     
    -* [Basics](mllib-basics.html)
    -  * data types 
    +* [Data Types](mllib-basics.html)
    --- End diff --
    
    `Data Types` -> `Data types` (to be consistent with others)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51988081
  
    QA tests have started for PR 1908. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18390/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51960126
  
    QA tests have started for PR 1908. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18381/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16147722
  
    --- Diff: docs/mllib-guide.md ---
    @@ -23,17 +24,18 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
     * [Dimensionality reduction](mllib-dimensionality-reduction.html)
       * singular value decomposition (SVD)
       * principal component analysis (PCA)
    -* [Optimization](mllib-optimization.html)
    +* [Feature extraction](mllib-feature-extraction.html)
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16148842
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -33,24 +33,23 @@ the task of finding a minimizer of a convex function `$f$` that depends on a var
     Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
     the objective function is of the form
     `\begin{equation}
    -    f(\wv) := 
    -    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
    -    \lambda\, R(\wv_i)
    +    f(\wv) := \lambda\, R(\wv) +
    +    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)
         \label{eq:regPrimal}
         \ .
     \end{equation}`
     Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
     `$y_i\in\R$` are their corresponding labels, which we want to predict. 
     We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
    -Several MLlib's classification and regression algorithms fall into this category,
    +Several of MLlib's classification and regression algorithms fall into this category,
     and are discussed here.
     
     The objective function `$f$` has two parts:
    -the loss that measures the error of the model on the training data, 
    -and the regularizer that measures the complexity of the model.
    -The loss function `$L(\wv;.)$` must be a convex function in `$\wv$`.
    +the regularizer that controls the complexity of the model,
    +and the loss that measures the error of the model on the training data.
    +The loss function `$L(\wv;.)$` is typically a convex function in `$\wv$`.
     The fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code) defines the trade-off
    -between the two goals of small loss and small model complexity.
    +between the two goals of minimizing the loss (i.e., training error) and minimizing model complexity (i.e., to avoid overfitting).
    --- End diff --
    
    I fixed this particular example, though there are many other instances of long lines.  Perhaps I can create a separate PR that addresses this issue in all the mllib markdown files...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51992216
  
    QA tests have started for PR 1908. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18394/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51967141
  
    QA results for PR 1908:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18381/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143689
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -33,24 +33,23 @@ the task of finding a minimizer of a convex function `$f$` that depends on a var
     Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
     the objective function is of the form
     `\begin{equation}
    -    f(\wv) := 
    -    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
    -    \lambda\, R(\wv_i)
    +    f(\wv) := \lambda\, R(\wv) +
    +    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)
         \label{eq:regPrimal}
         \ .
     \end{equation}`
     Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
     `$y_i\in\R$` are their corresponding labels, which we want to predict. 
     We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
    -Several MLlib's classification and regression algorithms fall into this category,
    +Several of MLlib's classification and regression algorithms fall into this category,
     and are discussed here.
     
     The objective function `$f$` has two parts:
    -the loss that measures the error of the model on the training data, 
    -and the regularizer that measures the complexity of the model.
    -The loss function `$L(\wv;.)$` must be a convex function in `$\wv$`.
    +the regularizer that controls the complexity of the model,
    +and the loss that measures the error of the model on the training data.
    +The loss function `$L(\wv;.)$` is typically a convex function in `$\wv$`.
     The fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code) defines the trade-off
    -between the two goals of small loss and small model complexity.
    +between the two goals of minimizing the loss (i.e., training error) and minimizing model complexity (i.e., to avoid overfitting).
    --- End diff --
    
    minor: This line is too long. We don't have a style guide for markdown files. Usually I would try to make each line contain at most 100 chars, and do not put more than one sentences in the same line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16149982
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -106,27 +105,25 @@ Here `$\mathrm{sign}(\wv)$` is the vector consisting of the signs (`$\pm1$`) of
     of `$\wv$`.
     
     L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
    -However, L1 regularization can help promote sparsity in weights, leading to simpler models, which is
    -also used for feature selection.  It is not recommended to train models without any regularization,
    +However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
    +It is not recommended to train models without any regularization,
     especially when the number of training examples is small.
     
     ## Binary classification
     
    -[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) is to divide items into
    +[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) aims to divide items into
     two categories: positive and negative.  MLlib supports two linear methods for binary classification:
    -linear support vector machine (SVM) and logistic regression.  The training data set is represented
    +linear support vector machines (SVMs) and logistic regression.  The training data set is represented
    --- End diff --
    
    Just added a sentence to this effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51995210
  
    LGTM. I'm merging this into both master and branch-1.1, so people can help improve individual sections. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143707
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -134,39 +131,42 @@ By default, linear SVMs are trained with an L2 regularization.
     We also support alternative L1 regularization. In this case,
     the problem becomes a [linear program](http://en.wikipedia.org/wiki/Linear_programming).
     
    -Linear SVM algorithm outputs a SVM model, which makes predictions based on the value of $\wv^T \x$.
    -By the default, if $\wv^T \x \geq 0$, the outcome is positive, or negative otherwise.
    -However, quite often in practice, the default threshold $0$ is not a good choice.
    -The threshold should be determined via model evaluation.
    +The linear SVMs algorithm outputs an SVMs model. Given a new data point, denoted by $\x$, the model makes predictions based on the value of $\wv^T \x$.
    --- End diff --
    
    `an SVMs model` -> `an SVM model`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1908


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143685
  
    --- Diff: docs/mllib-guide.md ---
    @@ -23,17 +24,18 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
     * [Dimensionality reduction](mllib-dimensionality-reduction.html)
       * singular value decomposition (SVD)
       * principal component analysis (PCA)
    -* [Optimization](mllib-optimization.html)
    +* [Feature extraction](mllib-feature-extraction.html)
    --- End diff --
    
    `Feature extraction` -> `Feature extraction and transformation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16149641
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -106,27 +105,25 @@ Here `$\mathrm{sign}(\wv)$` is the vector consisting of the signs (`$\pm1$`) of
     of `$\wv$`.
     
     L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
    -However, L1 regularization can help promote sparsity in weights, leading to simpler models, which is
    -also used for feature selection.  It is not recommended to train models without any regularization,
    +However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
    +It is not recommended to train models without any regularization,
     especially when the number of training examples is small.
     
     ## Binary classification
     
    -[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) is to divide items into
    +[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) aims to divide items into
     two categories: positive and negative.  MLlib supports two linear methods for binary classification:
    -linear support vector machine (SVM) and logistic regression.  The training data set is represented
    +linear support vector machines (SVMs) and logistic regression.  The training data set is represented
    --- End diff --
    
    That's an interesting point. For linear regression, people created new names for different types of regularization, some of which even shadowed the original name. It would be nice to add a sentence to clarify that we are counting different regularization types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16147703
  
    --- Diff: docs/mllib-guide.md ---
    @@ -3,18 +3,19 @@ layout: global
     title: Machine Learning Library (MLlib)
     ---
     
    -MLlib is a Spark implementation of some common machine learning algorithms and utilities,
    +MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
     including classification, regression, clustering, collaborative
    -filtering, dimensionality reduction, as well as underlying optimization primitives:
    +filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
     
    -* [Basics](mllib-basics.html)
    -  * data types 
    +* [Data Types](mllib-basics.html)
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143683
  
    --- Diff: docs/mllib-guide.md ---
    @@ -3,18 +3,19 @@ layout: global
     title: Machine Learning Library (MLlib)
     ---
     
    -MLlib is a Spark implementation of some common machine learning algorithms and utilities,
    +MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
     including classification, regression, clustering, collaborative
    -filtering, dimensionality reduction, as well as underlying optimization primitives:
    +filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
     
    -* [Basics](mllib-basics.html)
    -  * data types 
    +* [Data Types](mllib-basics.html)
    +* [Statistics functionality](mllib-stats.html)
    --- End diff --
    
    `Statistics functionality` -> `Basic statistics`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143694
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -106,27 +105,25 @@ Here `$\mathrm{sign}(\wv)$` is the vector consisting of the signs (`$\pm1$`) of
     of `$\wv$`.
     
     L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
    -However, L1 regularization can help promote sparsity in weights, leading to simpler models, which is
    -also used for feature selection.  It is not recommended to train models without any regularization,
    +However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
    +It is not recommended to train models without any regularization,
     especially when the number of training examples is small.
     
     ## Binary classification
     
    -[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) is to divide items into
    +[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) aims to divide items into
     two categories: positive and negative.  MLlib supports two linear methods for binary classification:
    -linear support vector machine (SVM) and logistic regression.  The training data set is represented
    +linear support vector machines (SVMs) and logistic regression.  The training data set is represented
    --- End diff --
    
    How many linear SVMs do you count here? Does linear SVM with different regularization count as different SVMs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51995438
  
    QA results for PR 1908:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18394/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51992287
  
    QA results for PR 1908:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18390/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16147709
  
    --- Diff: docs/mllib-guide.md ---
    @@ -3,18 +3,19 @@ layout: global
     title: Machine Learning Library (MLlib)
     ---
     
    -MLlib is a Spark implementation of some common machine learning algorithms and utilities,
    +MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
     including classification, regression, clustering, collaborative
    -filtering, dimensionality reduction, as well as underlying optimization primitives:
    +filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
     
    -* [Basics](mllib-basics.html)
    -  * data types 
    +* [Data Types](mllib-basics.html)
    +* [Statistics functionality](mllib-stats.html)
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLLIB]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51969508
  
    Thanks Xiangrui and Mark.  I just fixed this.
    
    
    On Tue, Aug 12, 2014 at 12:07 PM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > @atalwalkar <https://github.com/atalwalkar> Could you add [MLLIB]
    > re-organize mllib documentation to the title?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1908#issuecomment-51961773>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51962565
  
    nit: That's not really an adequate title for this PR, Ameet.  It should include enough description so that we can tell what it is about in the corresponding subject of the emailed notification without needing to consult JIRA. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1908#issuecomment-51961773
  
    @atalwalkar Could you add `[MLLIB] re-organize mllib documentation` to the title?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16148869
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -81,8 +80,8 @@ methods MLlib supports:
     ### Regularizers
     
     The purpose of the [regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
    -encourage simple models, by punishing the complexity of the model `$\wv$`, in order to e.g. avoid
    -over-fitting.
    +encourage simple models and avoid
    +overfitting.
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16149793
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -134,39 +131,42 @@ By default, linear SVMs are trained with an L2 regularization.
     We also support alternative L1 regularization. In this case,
     the problem becomes a [linear program](http://en.wikipedia.org/wiki/Linear_programming).
     
    -Linear SVM algorithm outputs a SVM model, which makes predictions based on the value of $\wv^T \x$.
    -By the default, if $\wv^T \x \geq 0$, the outcome is positive, or negative otherwise.
    -However, quite often in practice, the default threshold $0$ is not a good choice.
    -The threshold should be determined via model evaluation.
    +The linear SVMs algorithm outputs an SVMs model. Given a new data point, denoted by $\x$, the model makes predictions based on the value of $\wv^T \x$.
    --- End diff --
    
    fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16149173
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -106,27 +105,25 @@ Here `$\mathrm{sign}(\wv)$` is the vector consisting of the signs (`$\pm1$`) of
     of `$\wv$`.
     
     L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
    -However, L1 regularization can help promote sparsity in weights, leading to simpler models, which is
    -also used for feature selection.  It is not recommended to train models without any regularization,
    +However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
    +It is not recommended to train models without any regularization,
     especially when the number of training examples is small.
     
     ## Binary classification
     
    -[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) is to divide items into
    +[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) aims to divide items into
     two categories: positive and negative.  MLlib supports two linear methods for binary classification:
    -linear support vector machine (SVM) and logistic regression.  The training data set is represented
    +linear support vector machines (SVMs) and logistic regression.  The training data set is represented
    --- End diff --
    
    It does seem odd to talk about linear regression, lasso and ridge regression as distinct algorithms, but talk about linear SVMs and logistic regression as single algorithms and not mention different regularizers.  That being said, this is often done in practice.  We could add the following sentence to clarify this point, e.g.,  "...: linear support vector machine (SVMs) and logistic regression.  We support both L1 and L2 regularization for these methods."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-2830 [MLlib]: re-organize mllib document...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1908#discussion_r16143691
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -81,8 +80,8 @@ methods MLlib supports:
     ### Regularizers
     
     The purpose of the [regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
    -encourage simple models, by punishing the complexity of the model `$\wv$`, in order to e.g. avoid
    -over-fitting.
    +encourage simple models and avoid
    +overfitting.
    --- End diff --
    
    merge into previous line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org