You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by avulanov <gi...@git.apache.org> on 2015/02/21 00:44:17 UTC

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

GitHub user avulanov opened a pull request:

    https://github.com/apache/spark/pull/4709

    [MLLIB] SPARK-5912 Programming guide for feature selection

    Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/avulanov/spark SPARK-5912

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4709
    
----
commit c845350afd91ec5e5e329989fc770da23d0c459d
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2015-02-20T23:36:52Z

    ChiSqSelector docs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75610561
  
    Sorry for this, still sleeping...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75340531
  
      [Test build #27796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27796/consoleFull) for   PR 4709 at commit [`c845350`](https://github.com/apache/spark/commit/c845350afd91ec5e5e329989fc770da23d0c459d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75346567
  
      [Test build #27796 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27796/consoleFull) for   PR 4709 at commit [`c845350`](https://github.com/apache/spark/commit/c845350afd91ec5e5e329989fc770da23d0c459d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75626645
  
      [Test build #27860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull) for   PR 4709 at commit [`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75621857
  
    Merged into master and branch-1.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75462021
  
    The generated doc seems Ok except for the comments above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75626660
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27860/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4709


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4709#discussion_r25136231
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -375,3 +375,52 @@ data2 = labels.zip(normalizer2.transform(features))
     {% endhighlight %}
     </div>
     </div>
    +
    +## Feature selection
    +(Feature selection)[http://en.wikipedia.org/wiki/Feature_selection] allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. 
    +
    +### ChiSqSelector
    +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  
    +
    +#### Model Fitting
    +
    +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the
    +following parameters in the constructor:
    +
    +* `numTopFeatures` number of top features that selector will select (filter).
    +
    +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in
    +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then
    +return a model which can transform the input dataset into the reduced feature space.
    +
    +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
    +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on
    +an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
    +
    +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending.
    +
    +#### Example
    +
    +The following example shows the basic use of ChiSqSelector.
    +
    +<div class="codetabs">
    +<div data-lang="scala">
    +{% highlight scala %}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// load some data in libsvm format, each point is in the range 0..255
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +// discretize data in 16 equal bins
    +val discretizedData = data.map { lp =>
    +  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 } ) )
    +}
    +// create ChiSqSelector that will select 50 features
    +val selector = new ChiSqSelector(50)
    +// filter top 50 features
    +val filteredData = selector.fit(disctetizedData)
    --- End diff --
    
    typo here too: disctetizedData
    
    Also, "selector.fit" really returns a model, not the data.  Would you mind changing filteredData to be labeled as a model and then using the model to do something (like print the selected feature indices)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75614718
  
      [Test build #27857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull) for   PR 4709 at commit [`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75603592
  
    I think that last issue is the only one--thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4709#discussion_r25136229
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -375,3 +375,52 @@ data2 = labels.zip(normalizer2.transform(features))
     {% endhighlight %}
     </div>
     </div>
    +
    +## Feature selection
    +(Feature selection)[http://en.wikipedia.org/wiki/Feature_selection] allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. 
    --- End diff --
    
    Syntax for links: ```[link text](actual link)```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75638966
  
    Oops, did not realize that a test was still running (glad it passed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75599168
  
      [Test build #27857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull) for   PR 4709 at commit [`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4709#discussion_r25113939
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -375,3 +375,28 @@ data2 = labels.zip(normalizer2.transform(features))
     {% endhighlight %}
     </div>
     </div>
    +
    +## Feature selection
    +Feature selection allows selecting relevant features for use in model construction leaving out the redundant ones. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. 
    +
    +### ChiSqSelector
    +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  
    +
    +#### Model Fitting
    +
    +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the
    +following parameters in the constructor:
    +
    +* `numTopFeatures` number of top features that selector will select (filter).
    +
    +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in
    +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then
    +return a model which can transform the input dataset into the reduced feature space.
    +
    +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
    +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on
    +an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
    +
    +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending.
    +</div>
    --- End diff --
    
    Extraneous div tags


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75456675
  
    @avulanov Thanks for the updates!  Except for those 2 issues, I think this should be ready to go.  (I'm testing doc compilation now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75348983
  
      [Test build #27799 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27799/consoleFull) for   PR 4709 at commit [`eb6b9fe`](https://github.com/apache/spark/commit/eb6b9fe61126f3b75d4741bc2a978cd51fcc5ba9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75341961
  
    I think it's better to have an example, even if it doesn't really do anything useful on the toy datasets which ship with Spark.  We could add a hand-constructed dataset now or later on to improve the example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75621063
  
    LGTM  Thanks for the updates!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75614737
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27857/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75343738
  
      [Test build #27799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27799/consoleFull) for   PR 4709 at commit [`eb6b9fe`](https://github.com/apache/spark/commit/eb6b9fe61126f3b75d4741bc2a978cd51fcc5ba9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75346574
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27796/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4709#discussion_r25188678
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -375,3 +375,55 @@ data2 = labels.zip(normalizer2.transform(features))
     {% endhighlight %}
     </div>
     </div>
    +
    +## Feature selection
    +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. 
    +
    +### ChiSqSelector
    +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  
    +
    +#### Model Fitting
    +
    +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the
    +following parameters in the constructor:
    +
    +* `numTopFeatures` number of top features that selector will select (filter).
    +
    +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in
    +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then
    +return a model which can transform the input dataset into the reduced feature space.
    +
    +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
    +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on
    +an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
    +
    +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending.
    +
    +#### Example
    +
    +The following example shows the basic use of ChiSqSelector.
    +
    +<div class="codetabs">
    +<div data-lang="scala">
    +{% highlight scala %}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// load some data in libsvm format, each point is in the range 0..255
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +// discretize data in 16 equal bins
    +val discretizedData = data.map { lp =>
    +  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 } ) )
    +}
    +// create ChiSqSelector that will select 50 features
    +val selector = new ChiSqSelector(50)
    +// create ChiSqSelector model
    +val transformer = selector.fit(disctetizedData)
    +// filter top 50 features
    +val filteredData = transformer.transform(discretizedData)
    --- End diff --
    
    Since transform() takes an RDD[Vector], you'll need to map the data to features, and then zip the transformed features with the labels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4709#discussion_r25113936
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -375,3 +375,28 @@ data2 = labels.zip(normalizer2.transform(features))
     {% endhighlight %}
     </div>
     </div>
    +
    +## Feature selection
    +Feature selection allows selecting relevant features for use in model construction leaving out the redundant ones. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. 
    --- End diff --
    
    Would you mind adding a link to Wikipedia? [http://en.wikipedia.org/wiki/Feature_selection]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75611280
  
      [Test build #27860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull) for   PR 4709 at commit [`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4709#issuecomment-75348988
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27799/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org