You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by jk...@apache.org on 2015/02/23 21:10:31 UTC

spark git commit: [MLLIB] SPARK-5912 Programming guide for feature selection

Repository: spark
Updated Branches:
  refs/heads/master 651a1c019 -> 28ccf5ee7


[MLLIB] SPARK-5912 Programming guide for feature selection

Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.

Author: Alexander Ulanov <na...@yandex.ru>

Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:

19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28ccf5ee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28ccf5ee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28ccf5ee

Branch: refs/heads/master
Commit: 28ccf5ee769a1df019e38985112065c01724fbd9
Parents: 651a1c0
Author: Alexander Ulanov <na...@yandex.ru>
Authored: Mon Feb 23 12:09:40 2015 -0800
Committer: Joseph K. Bradley <jo...@databricks.com>
Committed: Mon Feb 23 12:09:40 2015 -0800

----------------------------------------------------------------------
 docs/mllib-feature-extraction.md | 54 +++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/28ccf5ee/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index d4a61a7..d588b9c 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -375,3 +375,57 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 </div>
 </div>
+
+## Feature selection
+[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. 
+
+### ChiSqSelector
+ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.  
+
+#### Model Fitting
+
+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the
+following parameters in the constructor:
+
+* `numTopFeatures` number of top features that selector will select (filter).
+
+We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in
+`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then
+return a model which can transform the input dataset into the reduced feature space.
+
+This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
+which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on
+an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
+
+Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending.
+
+#### Example
+
+The following example shows the basic use of ChiSqSelector.
+
+<div class="codetabs">
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLUtils
+
+// load some data in libsvm format, each point is in the range 0..255
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// discretize data in 16 equal bins
+val discretizedData = data.map { lp =>
+  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 } ) )
+}
+// create ChiSqSelector that will select 50 features
+val selector = new ChiSqSelector(50)
+// create ChiSqSelector model
+val transformer = selector.fit(disctetizedData)
+// filter top 50 features from each feature vector
+val filteredData = disctetizedData.map { lp => 
+  LabeledPoint(lp.label, transformer.transform(lp.features)) 
+}
+{% endhighlight %}
+</div>
+</div>
+


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org