You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2015/09/14 20:53:52 UTC

[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/8752

    [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups

    Various ML guide cleanups.
    
    * ml-guide.md: Make it easier to access the algorithm-specific guides.
    * LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically.  E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
    * mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
    * Clean up Binarizer user guide a little.
    * Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
    * spark.ml Word2Vec user guide: clean up grammar/writing
    * Chi Sq Feature Selector docs: Improve text in doc.
    
    CC: @mengxr @feynmanliang 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark mlguide-fixes-1.5

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8752.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8752
    
----
commit 53d757a74f156893f2fafc5c65624acfb2920ffa
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2015-09-14T18:50:51Z

    ml-guide.md: Make it easier to access the algorithm-specific guides.
    
    LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically.  E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
    
    mllib-feature-extraction.html#elementwiseproduct
    * “w” parameter should be “scalingVec”
    
    Clean up Binarizer user guide a little.
    
    Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
    
    spark.ml Word2Vec user guide:
    * clean up grammar/writing
    
    Chi Sq Feature Selector docs
    * Improve text in doc.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39461756
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -486,7 +492,8 @@ sc.stop();
     
     ## ElementwiseProduct
     
    -ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier.  This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `w`, to yield a result vector.
    +ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier.  This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `scalingVec`, to yield a result vector.
    --- End diff --
    
    100cw


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39555841
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -380,35 +380,37 @@ data2 = labels.zip(normalizer2.transform(features))
     </div>
     </div>
     
    -## Feature selection
    -[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
    +## ChiSqSelector
     
    -### ChiSqSelector
    -[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) stands for Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
    +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and statistical learning behavior.
     
    -#### Model Fitting
    +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
    --- End diff --
    
    This isn't actually required for docs.  I prefer it, but I don't want to modify stuff unnecessarily if it complicates the Github diff.  I'll check through for cases where I can fix it w/o affecting the diff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39461620
  
    --- Diff: docs/mllib-clustering.md ---
    @@ -507,6 +507,10 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
     $> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
     * `maxIterations`: The maximum number of EM iterations.
     
    +*Hint*: It is important to do enough iterations.  In early iterations, EM often has useless topics,
    --- End diff --
    
    nit: I prefer "Note" instead of "Hint" since I haven't seen "Hint" anywhere else in user guides, but feel free to leave if you like better


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140181255
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42436/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140181254
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140175794
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140177522
  
      [Test build #42436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42436/consoleFull) for   PR 8752 at commit [`53d757a`](https://github.com/apache/spark/commit/53d757a74f156893f2fafc5c65624acfb2920ffa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39461708
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -380,35 +380,37 @@ data2 = labels.zip(normalizer2.transform(features))
     </div>
     </div>
     
    -## Feature selection
    -[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
    +## ChiSqSelector
     
    -### ChiSqSelector
    -[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) stands for Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
    +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and statistical learning behavior.
     
    -#### Model Fitting
    +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
    --- End diff --
    
    100cw?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140520578
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39559676
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -506,7 +523,7 @@ v_N
     
     [`ElementwiseProduct`](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) has the following parameter in the constructor:
    --- End diff --
    
    Not part of your PR, but this API doc link would be better if it were inside the codetabs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8752


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140520533
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140175749
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39559560
  
    --- Diff: docs/ml-features.md ---
    @@ -123,12 +123,21 @@ for features_label in rescaledData.select("features", "label").take(3):
     
     ## Word2Vec
     
    -`Word2Vec` is an `Estimator` which takes sequences of words that represents documents and trains a `Word2VecModel`. The model is a `Map(String, Vector)` essentially, which maps each word to an unique fix-sized vector. The `Word2VecModel` transforms each documents into a vector using the average of all words in the document, which aims to other computations of documents such as similarity calculation consequencely. Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#Word2Vec) for more details on Word2Vec.
    +`Word2Vec` is an `Estimator` which takes sequences of words representing documents and trains a
    +`Word2VecModel`. The model maps each word to a unique fixed-size vector. The `Word2VecModel`
    +transforms each document into a vector using the average of all words in the document; this vector
    +can then be used for as features for prediction, document similarity calculations, etc.
    +Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#Word2Vec) for more
    +details.
     
    -Word2Vec is implemented in [Word2Vec](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec). In the following code segment, we start with a set of documents, each of them is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.
    +In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
    +
    +Refer to the [Word2Vec Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec)
    --- End diff --
    
    The classname is \`backticked\` in ChiSqSelector but not here or in Binarizer, we should choose one and be consistent. I would vote for backticking everything since that's what I've been doing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39461489
  
    --- Diff: docs/ml-guide.md ---
    @@ -32,7 +32,18 @@ See the [algorithm guides](#algorithm-guides) section below for guides on sub-pa
     * This will become a table of contents (this text will be scraped).
     {:toc}
     
    -# Main concepts
    +# Algorithm guides
    +
    +We provide several algorithm guides specific to the Pipelines API.
    +Several of these algorithms, such as certain feature transformers, are not in the `spark.mllib` API.
    +
    +* [Feature extraction, transformation, and selection](ml-features.html)
    +* [Decision Trees for classification and regression](ml-decision-tree.html)
    +* [Ensembles](ml-ensembles.html)
    +* [Linear methods with elastic net regularization](ml-linear-methods.html)
    --- End diff --
    
    Not sure if we should include model summaries in this description; I had a mailing list question about where that feature is documented


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140578097
  
      [Test build #42500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42500/console) for   PR 8752 at commit [`91f4edd`](https://github.com/apache/spark/commit/91f4eddd16a28230b8a241609d614501e34a393f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `and then filters (selects) the top features which the class label depends on the most.`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140181119
  
      [Test build #42436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42436/console) for   PR 8752 at commit [`53d757a`](https://github.com/apache/spark/commit/53d757a74f156893f2fafc5c65624acfb2920ffa).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140526380
  
    LGTM after changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140519878
  
    @feynmanliang Thanks for reviewing.  Just updated per your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140608613
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140522423
  
      [Test build #42500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42500/consoleFull) for   PR 8752 at commit [`91f4edd`](https://github.com/apache/spark/commit/91f4eddd16a28230b8a241609d614501e34a393f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39461788
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -486,7 +492,8 @@ sc.stop();
     
     ## ElementwiseProduct
     
    -ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier.  This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `w`, to yield a result vector.
    +ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier.  This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `scalingVec`, to yield a result vector.
    --- End diff --
    
    "`ElementwiseProduct`"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39461748
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -380,35 +380,37 @@ data2 = labels.zip(normalizer2.transform(features))
     </div>
     </div>
     
    -## Feature selection
    -[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
    +## ChiSqSelector
     
    -### ChiSqSelector
    -[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) stands for Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
    +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and statistical learning behavior.
    --- End diff --
    
    100cw


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8752#discussion_r39555824
  
    --- Diff: docs/ml-guide.md ---
    @@ -32,7 +32,18 @@ See the [algorithm guides](#algorithm-guides) section below for guides on sub-pa
     * This will become a table of contents (this text will be scraped).
     {:toc}
     
    -# Main concepts
    +# Algorithm guides
    +
    +We provide several algorithm guides specific to the Pipelines API.
    +Several of these algorithms, such as certain feature transformers, are not in the `spark.mllib` API.
    +
    +* [Feature extraction, transformation, and selection](ml-features.html)
    +* [Decision Trees for classification and regression](ml-decision-tree.html)
    +* [Ensembles](ml-ensembles.html)
    +* [Linear methods with elastic net regularization](ml-linear-methods.html)
    --- End diff --
    
    Yeah, there's not a great place.  I'll try sticking a note here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140578203
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML g...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8752#issuecomment-140578205
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42500/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org