You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Ulanov, Alexander" <al...@hp.com> on 2014/07/10 19:38:33 UTC

Feature selection interface

Hi,

I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics and other methods like mutual information or information gain can be easily implemented. I would like to make a pull request. However, MLlib master branch doesn't have any feature selection methods implemented. So, I need to create a proper interface that my class will extend or mix. It should be easy to use from developers and users prospective.

I was thinking that there should be FeatureEvaluator that for each feature from RDD[LabeledPoint] returns RDD[((featureIndex: Int, label: Double), value: Double)].
Then there should be FeatureSelector that selects top N features or top N features group by class etc.
And the simplest one, FeatureFilter that filters the data based on set of feature indices.

Additionally, there should be the interface for FeatureEvaluators that don't use class labels, i.e. for RDD[Vector].

I am concerned that such design looks rather "disconnected" because there are 3 disconnected objects.

As a result of use, I would like to see something like "val filteredData = Filter(data, ChiSquared(data).selectTop(100))".

Any ideas or suggestions?

Best regards, Alexander

RE: Feature selection interface

Posted by "Ulanov, Alexander" <al...@hp.com>.
FYI This is my first take on feature selection, filtering and chi-squared:
https://github.com/apache/spark/pull/1484


-----Original Message-----
From: Ulanov, Alexander 
Sent: Thursday, July 10, 2014 9:39 PM
To: dev@spark.apache.org
Subject: Feature selection interface

Hi,

I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics and other methods like mutual information or information gain can be easily implemented. I would like to make a pull request. However, MLlib master branch doesn't have any feature selection methods implemented. So, I need to create a proper interface that my class will extend or mix. It should be easy to use from developers and users prospective.

I was thinking that there should be FeatureEvaluator that for each feature from RDD[LabeledPoint] returns RDD[((featureIndex: Int, label: Double), value: Double)].
Then there should be FeatureSelector that selects top N features or top N features group by class etc.
And the simplest one, FeatureFilter that filters the data based on set of feature indices.

Additionally, there should be the interface for FeatureEvaluators that don't use class labels, i.e. for RDD[Vector].

I am concerned that such design looks rather "disconnected" because there are 3 disconnected objects.

As a result of use, I would like to see something like "val filteredData = Filter(data, ChiSquared(data).selectTop(100))".

Any ideas or suggestions?

Best regards, Alexander