You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2014/08/21 00:23:30 UTC

[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/2068

    [SPARK-2841][MLlib] Documentation for feature transformations

    Documentation for newly added feature transformations:
    1. TF-IDF
    2. StandardScaler
    3. Normalizer

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AlpineNow/spark transformer-documentation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2068
    
----
commit e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31
Author: DB Tsai <db...@alpinenow.com>
Date:   2014-08-20T22:21:26Z

    documentation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-52858796
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19004/consoleFull) for   PR 2068 at commit [`e339f64`](https://github.com/apache/spark/commit/e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  shift # Ignore main class (org.apache.spark.deploy.SparkSubmit) and use our own`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-53138489
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19088/consoleFull) for   PR 2068 at commit [`109f324`](https://github.com/apache/spark/commit/109f32403a7395002a4eab9da46841d88f62d7cc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-53141048
  
    **Tests timed out** after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2068#discussion_r16581683
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
     </div>
     </div>
     
    -## TFIDF
    \ No newline at end of file
    +## TFIDF
    +
    +## StandardScaler
    +
    +Standardizes features by scaling to unit variance and/or removing the mean using column summary
    +statistics on the samples in the training set. For example, RBF kernel of Support Vector Machines
    +or the L1 and L2 regularized linear models typically assume that all features have unit variance
    +and/or zero mean.
    --- End diff --
    
    Your suggestion sounds good to me!  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2068#discussion_r16514387
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
     </div>
     </div>
     
    -## TFIDF
    \ No newline at end of file
    +## TFIDF
    +
    +## StandardScaler
    +
    +Standardizes features by scaling to unit variance and/or removing the mean using column summary
    +statistics on the samples in the training set. For example, RBF kernel of Support Vector Machines
    +or the L1 and L2 regularized linear models typically assume that all features have unit variance
    +and/or zero mean.
    +
    +Standardization can not only improve the convergence rate during the optimization process, but also
    +avoid the problem that when training linear models with regularization against a feature having
    +a variance that is orders of magnitude larger than others, it might dominate the objective function
    +and make the estimator unable to learn from other features correctly as expected.
    +
    +### Model Fitting
    +
    +[`StandardScaler`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) has the
    +following parameters in the constructor,
    --- End diff --
    
    "," -> ":"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-53217627
  
    LGTM. Merged into master and branch-1.1! Thanks for helping on the documentation!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2068#discussion_r16561045
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
     </div>
     </div>
     
    -## TFIDF
    \ No newline at end of file
    +## TFIDF
    +
    +## StandardScaler
    +
    +Standardizes features by scaling to unit variance and/or removing the mean using column summary
    +statistics on the samples in the training set. For example, RBF kernel of Support Vector Machines
    +or the L1 and L2 regularized linear models typically assume that all features have unit variance
    +and/or zero mean.
    --- End diff --
    
    How about I say
    "For example, RBF kernel of Support Vector Machines
    or the L1 and L2 regularized linear models typically works better when all features have unit variance
    and/or zero mean."
    
    I actually have this statement from scikit documentation.  
    http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2068#discussion_r16514371
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
     </div>
     </div>
     
    -## TFIDF
    \ No newline at end of file
    +## TFIDF
    +
    +## StandardScaler
    +
    +Standardizes features by scaling to unit variance and/or removing the mean using column summary
    +statistics on the samples in the training set. For example, RBF kernel of Support Vector Machines
    +or the L1 and L2 regularized linear models typically assume that all features have unit variance
    +and/or zero mean.
    +
    +Standardization can not only improve the convergence rate during the optimization process, but also
    +avoid the problem that when training linear models with regularization against a feature having
    +a variance that is orders of magnitude larger than others, it might dominate the objective function
    +and make the estimator unable to learn from other features correctly as expected.
    --- End diff --
    
    Suggested edit: "Standardization can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-52853909
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19004/consoleFull) for   PR 2068 at commit [`e339f64`](https://github.com/apache/spark/commit/e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2068


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-52981186
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19065/consoleFull) for   PR 2068 at commit [`0a8fd34`](https://github.com/apache/spark/commit/0a8fd34dcfb45be4e0cbae0078ff7bd5b97814bc).
     * This patch **fails** unit tests.
     * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-52858975
  
    copy @atalwalkar


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-53138329
  
    @atalwalkar and @mengxr I just addressed the merge conflict. I think it's ready to merge. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2068#issuecomment-52970122
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19065/consoleFull) for   PR 2068 at commit [`0a8fd34`](https://github.com/apache/spark/commit/0a8fd34dcfb45be4e0cbae0078ff7bd5b97814bc).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

Posted by atalwalkar <gi...@git.apache.org>.
Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2068#discussion_r16514111
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) <- synonyms) {
     </div>
     </div>
     
    -## TFIDF
    \ No newline at end of file
    +## TFIDF
    +
    +## StandardScaler
    +
    +Standardizes features by scaling to unit variance and/or removing the mean using column summary
    +statistics on the samples in the training set. For example, RBF kernel of Support Vector Machines
    +or the L1 and L2 regularized linear models typically assume that all features have unit variance
    +and/or zero mean.
    --- End diff --
    
    This is too strong of a statement.  Why not just say "Normalizing features to have unit variance and/or zero mean is very a common preprocessing step."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org