You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by we...@apache.org on 2021/01/16 03:12:08 UTC
[spark] branch branch-3.1 updated: [SPARK-34080][ML][PYTHON] Add
UnivariateFeatureSelector
This is an automated email from the ASF dual-hosted git repository.
weichenxu123 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.1 by this push:
new cb8fb0e [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector
cb8fb0e is described below
commit cb8fb0e3c43743dacc7a5e06d028ff60b49d9a5b
Author: Huaxin Gao <hu...@us.ibm.com>
AuthorDate: Sat Jan 16 11:09:23 2021 +0800
[SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector
### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.
### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100)
```
Or
numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100)
```
### How was this patch tested?
Add Unit test
Closes #31160 from huaxingao/UnivariateSelector.
Authored-by: Huaxin Gao <hu...@us.ibm.com>
Signed-off-by: Weichen Xu <we...@databricks.com>
(cherry picked from commit f3548837c643b2da03ce6b20b5b103e4392e52dc)
Signed-off-by: Weichen Xu <we...@databricks.com>
---
docs/ml-features.md | 111 +---
docs/ml-statistics.md | 54 +-
.../spark/examples/ml/JavaANOVATestExample.java | 75 ---
.../examples/ml/JavaFValueSelectorExample.java | 81 ---
.../spark/examples/ml/JavaFValueTestExample.java | 75 ---
...a => JavaUnivariateFeatureSelectorExample.java} | 21 +-
examples/src/main/python/ml/anova_test_example.py | 50 --
.../src/main/python/ml/fvalue_selector_example.py | 53 --
examples/src/main/python/ml/fvalue_test_example.py | 50 --
...e.py => univariate_feature_selector_example.py} | 16 +-
.../spark/examples/ml/ANOVATestExample.scala | 63 --
.../spark/examples/ml/FValueSelectorExample.scala | 69 ---
.../spark/examples/ml/FValueTestExample.scala | 63 --
...cala => UnivariateFeatureSelectorExample.scala} | 20 +-
.../apache/spark/ml/feature/ANOVASelector.scala | 195 ------
.../apache/spark/ml/feature/ChiSqSelector.scala | 1 +
.../apache/spark/ml/feature/FValueSelector.scala | 195 ------
.../org/apache/spark/ml/feature/Selector.scala | 12 +-
.../ml/feature/UnivariateFeatureSelector.scala | 467 ++++++++++++++
.../scala/org/apache/spark/ml/stat/ANOVATest.scala | 2 +-
.../org/apache/spark/ml/stat/FValueTest.scala | 2 +-
.../spark/ml/feature/ANOVASelectorSuite.scala | 206 -------
.../spark/ml/feature/FValueSelectorSuite.scala | 238 -------
.../feature/UnivariateFeatureSelectorSuite.scala | 685 +++++++++++++++++++++
python/docs/source/reference/pyspark.ml.rst | 8 +-
python/pyspark/ml/feature.py | 449 +++++++-------
python/pyspark/ml/feature.pyi | 116 ++--
python/pyspark/ml/stat.py | 148 -----
python/pyspark/ml/stat.pyi | 12 -
29 files changed, 1512 insertions(+), 2025 deletions(-)
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 660c272..dc87713 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1793,19 +1793,28 @@ for more details on the API.
</div>
</div>
-## ANOVASelector
+## UnivariateFeatureSelector
-`ANOVASelector` operates on categorical labels with continuous features. It uses the
-[one-way ANOVA F-test](https://en.wikipedia.org/wiki/F-test#Multiple-comparison_ANOVA_problems) to decide which
-features to choose.
-It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
-* `numTopFeatures` chooses a fixed number of top features according to ANOVA F-test.
+`UnivariateFeatureSelector` operates on categorical/continuous labels with categorical/continuous features.
+User can set `featureType` and `labelType`, and Spark will pick the score function to use based on the specified
+`featureType` and `labelType`.
+
+~~~
+featureType | labelType |score function
+------------|------------|--------------
+categorical |categorical | chi2
+continuous |categorical | f_classif
+continuous |continuous | f_regression
+~~~
+
+It supports five selection modes: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
+* `numTopFeatures` chooses a fixed number of top features.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
-By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
-The user can choose a selection method using `setSelectorType`.
+
+By default, the selection mode is `numTopFeatures`, with the default selectionThreshold sets to 50.
**Examples**
@@ -1823,7 +1832,7 @@ id | features | label
6 | [7.9, 8.5, 9.2, 4.0, 9.4, 2.1] | 4.0
~~~
-If we use `ANOVASelector` with `numTopFeatures = 1`, the
+If we set `featureType` to `continuous` and `labelType` to `categorical` with `numTopFeatures = 1`, the
last column in our `features` is chosen as the most useful feature:
~~~
@@ -1840,96 +1849,26 @@ id | features | label | selectedFeatures
<div class="codetabs">
<div data-lang="scala" markdown="1">
-Refer to the [ANOVASelector Scala docs](api/scala/org/apache/spark/ml/feature/ANOVASelector.html)
-for more details on the API.
-
-{% include_example scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-Refer to the [ANOVASelector Java docs](api/java/org/apache/spark/ml/feature/ANOVASelector.html)
-for more details on the API.
-
-{% include_example java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java %}
-</div>
-
-<div data-lang="python" markdown="1">
-
-Refer to the [ANOVASelector Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.ANOVASelector)
-for more details on the API.
-
-{% include_example python/ml/anova_selector_example.py %}
-</div>
-</div>
-
-## FValueSelector
-
-`FValueSelector` operates on categorical labels with continuous features. It uses the
-[F-test for regression](https://en.wikipedia.org/wiki/F-test#Regression_problems) to decide which
-features to choose.
-It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
-* `numTopFeatures` chooses a fixed number of top features according to a F-test for regression.
-* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
-* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
-* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
-* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
-By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
-The user can choose a selection method using `setSelectorType`.
-
-**Examples**
-
-Assume that we have a DataFrame with the columns `id`, `features`, and `label`, which is used as
-our target to be predicted:
-
-~~~
-id | features | label
----|--------------------------------|---------
- 1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6
- 2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6
- 3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1
- 4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6
- 5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0
- 6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0
-~~~
-
-If we use `FValueSelector` with `numTopFeatures = 1`, the
-3rd column in our `features` is chosen as the most useful feature:
-
-~~~
-id | features | label | selectedFeatures
----|--------------------------------|---------|------------------
- 1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6 | [0.0]
- 2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6 | [6.0]
- 3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1 | [3.0]
- 4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6 | [8.0]
- 5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0 | [6.0]
- 6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0 | [6.0]
-~~~
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-Refer to the [FValueSelector Scala docs](api/scala/org/apache/spark/ml/feature/FValueSelector.html)
+Refer to the [UnivariateFeatureSelector Scala docs](api/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.html)
for more details on the API.
-{% include_example scala/org/apache/spark/examples/ml/FValueSelectorExample.scala %}
+{% include_example scala/org/apache/spark/examples/ml/UnivariateFeatureSelectorExample.scala %}
</div>
<div data-lang="java" markdown="1">
-Refer to the [FValueSelector Java docs](api/java/org/apache/spark/ml/feature/FValueSelector.html)
+Refer to the [UnivariateFeatureSelector Java docs](api/java/org/apache/spark/ml/feature/UnivariateFeatureSelector.html)
for more details on the API.
-{% include_example java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java %}
+{% include_example java/org/apache/spark/examples/ml/JavaUnivariateFeatureSelectorExample.java %}
</div>
<div data-lang="python" markdown="1">
-Refer to the [FValueSelector Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.FValueSelector)
+Refer to the [UnivariateFeatureSelector Python docs](api/python/reference/api/pyspark.ml.feature.UnivariateFeatureSelector.html)
for more details on the API.
-{% include_example python/ml/anova_selector_example.py %}
+{% include_example python/ml/univariate_feature_selector_example.py %}
</div>
</div>
@@ -1974,7 +1913,7 @@ id | features | selectedFeatures
<div class="codetabs">
<div data-lang="scala" markdown="1">
-Refer to the [VarianceThresholdSelector Scala docs]((api/python/pyspark.ml.html#pyspark.ml.feature.ChiSqSelector))
+Refer to the [VarianceThresholdSelector Scala docs]((api/python/pyspark.ml.html#pyspark.ml.feature.VarianceThresholdSelector))
for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/VarianceThresholdSelectorExample.scala %}
diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md
index 637cdd6..334a42e 100644
--- a/docs/ml-statistics.md
+++ b/docs/ml-statistics.md
@@ -79,33 +79,7 @@ The output will be a DataFrame that contains the correlation matrix of the colum
Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's
-Chi-squared ( $\chi^2$) tests for independence, as well as ANOVA test for classification tasks and
-F-value test for regression tasks.
-
-### ANOVATest
-
-`ANOVATest` computes ANOVA F-values between labels and features for classification tasks. The labels should be categorical
-and features should be continuous.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-Refer to the [`ANOVATest` Scala docs](api/scala/org/apache/spark/ml/stat/ANOVATest$.html) for details on the API.
-
-{% include_example scala/org/apache/spark/examples/ml/ANOVATestExample.scala %}
-</div>
-
-<div data-lang="java" markdown="1">
-Refer to the [`ANOVATest` Java docs](api/java/org/apache/spark/ml/stat/ANOVATest.html) for details on the API.
-
-{% include_example java/org/apache/spark/examples/ml/JavaANOVATestExample.java %}
-</div>
-
-<div data-lang="python" markdown="1">
-Refer to the [`ANOVATest` Python docs](api/python/index.html#pyspark.ml.stat.ANOVATest$) for details on the API.
-
-{% include_example python/ml/anova_test_example.py %}
-</div>
-</div>
+Chi-squared ( $\chi^2$) tests for independence.
### ChiSquareTest
@@ -134,32 +108,6 @@ Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat
</div>
-### FValueTest
-
-`FValueTest` computes F-values between labels and features for regression tasks. Both the labels
- and features should be continuous.
-
- <div class="codetabs">
- <div data-lang="scala" markdown="1">
- Refer to the [`FValueTest` Scala docs](api/scala/org/apache/spark/ml/stat/FValueTest$.html) for details on the API.
-
- {% include_example scala/org/apache/spark/examples/ml/FValueTestExample.scala %}
- </div>
-
- <div data-lang="java" markdown="1">
- Refer to the [`FValueTest` Java docs](api/java/org/apache/spark/ml/stat/FValueTest.html) for details on the API.
-
- {% include_example java/org/apache/spark/examples/ml/JavaFValueTestExample.java %}
- </div>
-
- <div data-lang="python" markdown="1">
- Refer to the [`FValueTest` Python docs](api/python/index.html#pyspark.ml.stat.FValueTest$) for details on the API.
-
- {% include_example python/ml/fvalue_test_example.py %}
- </div>
-
- </div>
-
## Summarizer
We provide vector column summary statistics for `Dataframe` through `Summarizer`.
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
deleted file mode 100644
index 4785dbd..0000000
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
+++ /dev/null
@@ -1,75 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.examples.ml;
-
-import org.apache.spark.sql.SparkSession;
-
-// $example on$
-import java.util.Arrays;
-import java.util.List;
-
-import org.apache.spark.ml.linalg.Vectors;
-import org.apache.spark.ml.linalg.VectorUDT;
-import org.apache.spark.ml.stat.ANOVATest;
-import org.apache.spark.sql.Dataset;
-import org.apache.spark.sql.Row;
-import org.apache.spark.sql.RowFactory;
-import org.apache.spark.sql.types.*;
-// $example off$
-
-/**
- * An example for ANOVA testing.
- * Run with
- * <pre>
- * bin/run-example ml.JavaANOVATestExample
- * </pre>
- */
-public class JavaANOVATestExample {
-
- public static void main(String[] args) {
- SparkSession spark = SparkSession
- .builder()
- .appName("JavaANOVATestExample")
- .getOrCreate();
-
- // $example on$
- List<Row> data = Arrays.asList(
- RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
- RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
- RowFactory.create(3.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
- RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
- RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
- RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
- );
-
- StructType schema = new StructType(new StructField[]{
- new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- });
-
- Dataset<Row> df = spark.createDataFrame(data, schema);
- Row r = ANOVATest.test(df, "features", "label").head();
- System.out.println("pValues: " + r.get(0).toString());
- System.out.println("degreesOfFreedom: " + r.getList(1).toString());
- System.out.println("fValues: " + r.get(2).toString());
-
- // $example off$
-
- spark.stop();
- }
-}
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java
deleted file mode 100644
index e8253ff..0000000
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java
+++ /dev/null
@@ -1,81 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.examples.ml;
-
-import org.apache.spark.sql.Dataset;
-import org.apache.spark.sql.SparkSession;
-
-// $example on$
-import java.util.Arrays;
-import java.util.List;
-
-import org.apache.spark.ml.feature.FValueSelector;
-import org.apache.spark.ml.linalg.VectorUDT;
-import org.apache.spark.ml.linalg.Vectors;
-import org.apache.spark.sql.Row;
-import org.apache.spark.sql.RowFactory;
-import org.apache.spark.sql.types.*;
-// $example off$
-
-/**
- * An example demonstrating FValueSelector.
- * Run with
- * <pre>
- * bin/run-example ml.JavaFValueSelectorExample
- * </pre>
- */
-public class JavaFValueSelectorExample {
- public static void main(String[] args) {
- SparkSession spark = SparkSession
- .builder()
- .appName("JavaFValueSelectorExample")
- .getOrCreate();
-
- // $example on$
- List<Row> data = Arrays.asList(
- RowFactory.create(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0), 4.6),
- RowFactory.create(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0), 6.6),
- RowFactory.create(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0), 5.1),
- RowFactory.create(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0), 7.6),
- RowFactory.create(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0), 9.0),
- RowFactory.create(6, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0), 9.0)
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- new StructField("label", DataTypes.DoubleType, false, Metadata.empty())
- });
-
- Dataset<Row> df = spark.createDataFrame(data, schema);
-
- FValueSelector selector = new FValueSelector()
- .setNumTopFeatures(1)
- .setFeaturesCol("features")
- .setLabelCol("label")
- .setOutputCol("selectedFeatures");
-
- Dataset<Row> result = selector.fit(df).transform(df);
-
- System.out.println("FValueSelector output with top " + selector.getNumTopFeatures()
- + " features selected");
- result.show();
-
- // $example off$
- spark.stop();
- }
-}
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
deleted file mode 100644
index cda28db..0000000
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
+++ /dev/null
@@ -1,75 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.examples.ml;
-
-import org.apache.spark.sql.SparkSession;
-
-// $example on$
-import java.util.Arrays;
-import java.util.List;
-
-import org.apache.spark.ml.linalg.Vectors;
-import org.apache.spark.ml.linalg.VectorUDT;
-import org.apache.spark.ml.stat.FValueTest;
-import org.apache.spark.sql.Dataset;
-import org.apache.spark.sql.Row;
-import org.apache.spark.sql.RowFactory;
-import org.apache.spark.sql.types.*;
-// $example off$
-
-/**
- * An example for FValue testing.
- * Run with
- * <pre>
- * bin/run-example ml.JavaFValueTestExample
- * </pre>
- */
-public class JavaFValueTestExample {
-
- public static void main(String[] args) {
- SparkSession spark = SparkSession
- .builder()
- .appName("JavaFValueTestExample")
- .getOrCreate();
-
- // $example on$
- List<Row> data = Arrays.asList(
- RowFactory.create(4.6, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
- RowFactory.create(6.6, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
- RowFactory.create(5.1, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
- RowFactory.create(7.6, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
- RowFactory.create(9.0, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
- RowFactory.create(9.0, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0))
- );
-
- StructType schema = new StructType(new StructField[]{
- new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- });
-
- Dataset<Row> df = spark.createDataFrame(data, schema);
- Row r = FValueTest.test(df, "features", "label").head();
- System.out.println("pValues: " + r.get(0).toString());
- System.out.println("degreesOfFreedom: " + r.getList(1).toString());
- System.out.println("fvalues: " + r.get(2).toString());
-
- // $example off$
-
- spark.stop();
- }
-}
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaUnivariateFeatureSelectorExample.java
similarity index 79%
rename from examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java
rename to examples/src/main/java/org/apache/spark/examples/ml/JavaUnivariateFeatureSelectorExample.java
index 6f24b45..748262f 100644
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaUnivariateFeatureSelectorExample.java
@@ -24,7 +24,7 @@ import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
-import org.apache.spark.ml.feature.ANOVASelector;
+import org.apache.spark.ml.feature.UnivariateFeatureSelector;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.sql.Row;
@@ -33,17 +33,17 @@ import org.apache.spark.sql.types.*;
// $example off$
/**
- * An example for ANOVASelector.
+ * An example for UnivariateFeatureSelector.
* Run with
* <pre>
- * bin/run-example ml.JavaANOVASelectorExample
+ * bin/run-example ml.JavaUnivariateFeatureSelectorExample
* </pre>
*/
-public class JavaANOVASelectorExample {
+public class JavaUnivariateFeatureSelectorExample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
- .appName("JavaANOVASelectorExample")
+ .appName("JavaUnivariateFeatureSelectorExample")
.getOrCreate();
// $example on$
@@ -63,16 +63,19 @@ public class JavaANOVASelectorExample {
Dataset<Row> df = spark.createDataFrame(data, schema);
- ANOVASelector selector = new ANOVASelector()
- .setNumTopFeatures(1)
+ UnivariateFeatureSelector selector = new UnivariateFeatureSelector()
+ .setFeatureType("continuous")
+ .setLabelType("categorical")
+ .setSelectionMode("numTopFeatures")
+ .setSelectionThreshold(1)
.setFeaturesCol("features")
.setLabelCol("label")
.setOutputCol("selectedFeatures");
Dataset<Row> result = selector.fit(df).transform(df);
- System.out.println("ANOVASelector output with top " + selector.getNumTopFeatures()
- + " features selected");
+ System.out.println("UnivariateFeatureSelector output with top "
+ + selector.getSelectionThreshold() + " features selected using f_classif");
result.show();
// $example off$
diff --git a/examples/src/main/python/ml/anova_test_example.py b/examples/src/main/python/ml/anova_test_example.py
deleted file mode 100644
index 451e078..0000000
--- a/examples/src/main/python/ml/anova_test_example.py
+++ /dev/null
@@ -1,50 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-An example for ANOVA testing.
-Run with:
- bin/spark-submit examples/src/main/python/ml/anova_test_example.py
-"""
-from pyspark.sql import SparkSession
-# $example on$
-from pyspark.ml.linalg import Vectors
-from pyspark.ml.stat import ANOVATest
-# $example off$
-
-if __name__ == "__main__":
- spark = SparkSession\
- .builder\
- .appName("ANOVATestExample")\
- .getOrCreate()
-
- # $example on$
- data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])),
- (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])),
- (3.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])),
- (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])),
- (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])),
- (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))]
- df = spark.createDataFrame(data, ["label", "features"])
-
- r = ANOVATest.test(df, "features", "label").head()
- print("pValues: " + str(r.pValues))
- print("degreesOfFreedom: " + str(r.degreesOfFreedom))
- print("fValues: " + str(r.fValues))
- # $example off$
-
- spark.stop()
diff --git a/examples/src/main/python/ml/fvalue_selector_example.py b/examples/src/main/python/ml/fvalue_selector_example.py
deleted file mode 100644
index f164af4..0000000
--- a/examples/src/main/python/ml/fvalue_selector_example.py
+++ /dev/null
@@ -1,53 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-An example for FValueSelector.
-Run with:
- bin/spark-submit examples/src/main/python/ml/fvalue_selector_example.py
-"""
-from pyspark.sql import SparkSession
-# $example on$
-from pyspark.ml.feature import FValueSelector
-from pyspark.ml.linalg import Vectors
-# $example off$
-
-if __name__ == "__main__":
- spark = SparkSession\
- .builder\
- .appName("FValueSelectorExample")\
- .getOrCreate()
-
- # $example on$
- df = spark.createDataFrame([
- (1, Vectors.dense([6.0, 7.0, 0.0, 7.0, 6.0, 0.0]), 4.6,),
- (2, Vectors.dense([0.0, 9.0, 6.0, 0.0, 5.0, 9.0]), 6.6,),
- (3, Vectors.dense([0.0, 9.0, 3.0, 0.0, 5.0, 5.0]), 5.1,),
- (4, Vectors.dense([0.0, 9.0, 8.0, 5.0, 6.0, 4.0]), 7.6,),
- (5, Vectors.dense([8.0, 9.0, 6.0, 5.0, 4.0, 4.0]), 9.0,),
- (6, Vectors.dense([8.0, 9.0, 6.0, 4.0, 0.0, 0.0]), 9.0,)], ["id", "features", "label"])
-
- selector = FValueSelector(numTopFeatures=1, featuresCol="features",
- outputCol="selectedFeatures", labelCol="label")
-
- result = selector.fit(df).transform(df)
-
- print("FValueSelector output with top %d features selected" % selector.getNumTopFeatures())
- result.show()
- # $example off$
-
- spark.stop()
diff --git a/examples/src/main/python/ml/fvalue_test_example.py b/examples/src/main/python/ml/fvalue_test_example.py
deleted file mode 100644
index dfa8073..0000000
--- a/examples/src/main/python/ml/fvalue_test_example.py
+++ /dev/null
@@ -1,50 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-An example for FValue testing.
-Run with:
- bin/spark-submit examples/src/main/python/ml/fvalue_test_example.py
-"""
-from pyspark.sql import SparkSession
-# $example on$
-from pyspark.ml.linalg import Vectors
-from pyspark.ml.stat import FValueTest
-# $example off$
-
-if __name__ == "__main__":
- spark = SparkSession \
- .builder \
- .appName("FValueTestExample") \
- .getOrCreate()
-
- # $example on$
- data = [(4.6, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
- (6.6, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
- (5.1, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
- (7.6, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
- (9.0, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
- (9.0, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0))]
- df = spark.createDataFrame(data, ["label", "features"])
-
- ftest = FValueTest.test(df, "features", "label").head()
- print("pValues: " + str(ftest.pValues))
- print("degreesOfFreedom: " + str(ftest.degreesOfFreedom))
- print("fvalues: " + str(ftest.fValues))
- # $example off$
-
- spark.stop()
diff --git a/examples/src/main/python/ml/anova_selector_example.py b/examples/src/main/python/ml/univariate_feature_selector_example.py
similarity index 70%
rename from examples/src/main/python/ml/anova_selector_example.py
rename to examples/src/main/python/ml/univariate_feature_selector_example.py
index da80fa6..6dc293e 100644
--- a/examples/src/main/python/ml/anova_selector_example.py
+++ b/examples/src/main/python/ml/univariate_feature_selector_example.py
@@ -16,20 +16,20 @@
#
"""
-An example for ANOVASelector.
+An example for UnivariateFeatureSelector.
Run with:
- bin/spark-submit examples/src/main/python/ml/anova_selector_example.py
+ bin/spark-submit examples/src/main/python/ml/univariate_feature_selector_example.py
"""
from pyspark.sql import SparkSession
# $example on$
-from pyspark.ml.feature import ANOVASelector
+from pyspark.ml.feature import UnivariateFeatureSelector
from pyspark.ml.linalg import Vectors
# $example off$
if __name__ == "__main__":
spark = SparkSession\
.builder\
- .appName("ANOVASelectorExample")\
+ .appName("UnivariateFeatureSelectorExample")\
.getOrCreate()
# $example on$
@@ -41,12 +41,14 @@ if __name__ == "__main__":
(5, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0,),
(6, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0,)], ["id", "features", "label"])
- selector = ANOVASelector(numTopFeatures=1, featuresCol="features",
- outputCol="selectedFeatures", labelCol="label")
+ selector = UnivariateFeatureSelector(featuresCol="features", outputCol="selectedFeatures",
+ labelCol="label", selectionMode="numTopFeatures")
+ selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(1)
result = selector.fit(df).transform(df)
- print("ANOVASelector output with top %d features selected" % selector.getNumTopFeatures())
+ print("UnivariateFeatureSelector output with top %d features selected using f_classif"
+ % selector.getSelectionThreshold())
result.show()
# $example off$
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
deleted file mode 100644
index f0b9f23..0000000
--- a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
+++ /dev/null
@@ -1,63 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// scalastyle:off println
-package org.apache.spark.examples.ml
-
-// $example on$
-import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.stat.ANOVATest
-// $example off$
-import org.apache.spark.sql.SparkSession
-
-/**
- * An example for ANOVA testing.
- * Run with
- * {{{
- * bin/run-example ml.ANOVATestExample
- * }}}
- */
-object ANOVATestExample {
-
- def main(args: Array[String]): Unit = {
- val spark = SparkSession
- .builder
- .appName("ANOVATestExample")
- .getOrCreate()
- import spark.implicits._
-
- // $example on$
- val data = Seq(
- (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
- (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
- (3.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
- (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
- (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
- (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
- )
-
- val df = data.toDF("label", "features")
- val anova = ANOVATest.test(df, "features", "label").head
- println(s"pValues = ${anova.getAs[Vector](0)}")
- println(s"degreesOfFreedom ${anova.getSeq[Int](1).mkString("[", ",", "]")}")
- println(s"fValues ${anova.getAs[Vector](2)}")
- // $example off$
-
- spark.stop()
- }
-}
-// scalastyle:on println
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala
deleted file mode 100644
index 914d81b..0000000
--- a/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// scalastyle:off println
-package org.apache.spark.examples.ml
-
-// $example on$
-import org.apache.spark.ml.feature.FValueSelector
-import org.apache.spark.ml.linalg.Vectors
-// $example off$
-import org.apache.spark.sql.SparkSession
-
-/**
- * An example for FValueSelector.
- * Run with
- * {{{
- * bin/run-example ml.FValueSelectorExample
- * }}}
- */
-object FValueSelectorExample {
- def main(args: Array[String]): Unit = {
- val spark = SparkSession
- .builder
- .appName("FValueSelectorExample")
- .getOrCreate()
- import spark.implicits._
-
- // $example on$
- val data = Seq(
- (1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0), 4.6),
- (2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0), 6.6),
- (3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0), 5.1),
- (4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0), 7.6),
- (5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0), 9.0),
- (6, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0), 9.0)
- )
-
- val df = spark.createDataset(data).toDF("id", "features", "label")
-
- val selector = new FValueSelector()
- .setNumTopFeatures(1)
- .setFeaturesCol("features")
- .setLabelCol("label")
- .setOutputCol("selectedFeatures")
-
- val result = selector.fit(df).transform(df)
-
- println(s"FValueSelector output with top ${selector.getNumTopFeatures} features selected")
- result.show()
- // $example off$
-
- spark.stop()
- }
-}
-// scalastyle:on println
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala
deleted file mode 100644
index 08ec22c..0000000
--- a/examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala
+++ /dev/null
@@ -1,63 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// scalastyle:off println
-package org.apache.spark.examples.ml
-
-// $example on$
-import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.stat.FValueTest
-// $example off$
-import org.apache.spark.sql.SparkSession
-
-/**
- * An example for FValue testing.
- * Run with
- * {{{
- * bin/run-example ml.FValueTestExample
- * }}}
- */
-object FValueTestExample {
-
- def main(args: Array[String]): Unit = {
- val spark = SparkSession
- .builder
- .appName("FValueTestExample")
- .getOrCreate()
- import spark.implicits._
-
- // $example on$
- val data = Seq(
- (4.6, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
- (6.6, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
- (5.1, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
- (7.6, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
- (9.0, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
- (9.0, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0))
- )
-
- val df = data.toDF("label", "features")
- val fValue = FValueTest.test(df, "features", "label").head
- println(s"pValues ${fValue.getAs[Vector](0)}")
- println(s"degreesOfFreedom ${fValue.getSeq[Int](1).mkString("[", ",", "]")}")
- println(s"fValues ${fValue.getAs[Vector](2)}")
- // $example off$
-
- spark.stop()
- }
-}
-// scalastyle:on println
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/UnivariateFeatureSelectorExample.scala
similarity index 76%
rename from examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala
rename to examples/src/main/scala/org/apache/spark/examples/ml/UnivariateFeatureSelectorExample.scala
index 46803cc..e4932db 100644
--- a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/ml/UnivariateFeatureSelectorExample.scala
@@ -19,23 +19,23 @@
package org.apache.spark.examples.ml
// $example on$
-import org.apache.spark.ml.feature.ANOVASelector
+import org.apache.spark.ml.feature.UnivariateFeatureSelector
import org.apache.spark.ml.linalg.Vectors
// $example off$
import org.apache.spark.sql.SparkSession
/**
- * An example for ANOVASelector.
+ * An example for UnivariateFeatureSelector.
* Run with
* {{{
- * bin/run-example ml.ANOVASelectorExample
+ * bin/run-example ml.UnivariateFeatureSelectorExample
* }}}
*/
-object ANOVASelectorExample {
+object UnivariateFeatureSelectorExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
- .appName("ANOVASelectorExample")
+ .appName("UnivariateFeatureSelectorExample")
.getOrCreate()
import spark.implicits._
@@ -51,15 +51,19 @@ object ANOVASelectorExample {
val df = spark.createDataset(data).toDF("id", "features", "label")
- val selector = new ANOVASelector()
- .setNumTopFeatures(1)
+ val selector = new UnivariateFeatureSelector()
+ .setFeatureType("continuous")
+ .setLabelType("categorical")
+ .setSelectionMode("numTopFeatures")
+ .setSelectionThreshold(1)
.setFeaturesCol("features")
.setLabelCol("label")
.setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
- println(s"ANOVASelector output with top ${selector.getNumTopFeatures} features selected")
+ println(s"UnivariateFeatureSelector output with top ${selector.getSelectionThreshold}" +
+ s" features selected using f_classif")
result.show()
// $example off$
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/ANOVASelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/ANOVASelector.scala
deleted file mode 100644
index 81ffd01..0000000
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ANOVASelector.scala
+++ /dev/null
@@ -1,195 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.feature
-
-import org.apache.hadoop.fs.Path
-
-import org.apache.spark.annotation.Since
-import org.apache.spark.ml.param._
-import org.apache.spark.ml.stat.ANOVATest
-import org.apache.spark.ml.util._
-import org.apache.spark.sql.{DataFrame, Dataset}
-
-
-/**
- * ANOVA F-value Classification selector, which selects continuous features to use for predicting a
- * categorical label.
- * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
- * `fdr`, `fwe`.
- * - `numTopFeatures` chooses a fixed number of top features according to a F value classification
- * test.
- * - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
- * - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
- * positive rate of selection.
- * - `fdr` uses the [Benjamini-Hochberg procedure]
- * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
- * to choose all features whose false discovery rate is below a threshold.
- * - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
- * 1/numFeatures, thus controlling the family-wise error rate of selection.
- * By default, the selection method is `numTopFeatures`, with the default number of top features
- * set to 50.
- */
-@Since("3.1.0")
-final class ANOVASelector @Since("3.1.0")(@Since("3.1.0") override val uid: String)
- extends Selector[ANOVASelectorModel] {
-
- @Since("3.1.0")
- def this() = this(Identifiable.randomUID("ANOVASelector"))
-
- /** @group setParam */
- @Since("3.1.0")
- override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setPercentile(value: Double): this.type = super.setPercentile(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFpr(value: Double): this.type = super.setFpr(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFdr(value: Double): this.type = super.setFdr(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFwe(value: Double): this.type = super.setFwe(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setSelectorType(value: String): this.type = super.setSelectorType(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setOutputCol(value: String): this.type = super.setOutputCol(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setLabelCol(value: String): this.type = super.setLabelCol(value)
-
- /**
- * get the SelectionTestResult for every feature against the label
- */
- protected[this] override def getSelectionTestResult(df: DataFrame): DataFrame = {
- ANOVATest.test(df, getFeaturesCol, getLabelCol, true)
- }
-
- /**
- * Create a new instance of concrete SelectorModel.
- * @param indices The indices of the selected features
- * @return A new SelectorModel instance
- */
- protected[this] def createSelectorModel(
- uid: String,
- indices: Array[Int]): ANOVASelectorModel = {
- new ANOVASelectorModel(uid, indices)
- }
-
- @Since("3.1.0")
- override def fit(dataset: Dataset[_]): ANOVASelectorModel = {
- super.fit(dataset)
- }
-
- @Since("3.1.0")
- override def copy(extra: ParamMap): this.type = defaultCopy(extra)
-}
-
-@Since("3.1.0")
-object ANOVASelector extends DefaultParamsReadable[ANOVASelector] {
-
- @Since("3.1.0")
- override def load(path: String): ANOVASelector = super.load(path)
-}
-
-/**
- * Model fitted by [[ANOVASelector]].
- */
-@Since("3.1.0")
-class ANOVASelectorModel private[ml](
- @Since("3.1.0") override val uid: String,
- @Since("3.1.0") override val selectedFeatures: Array[Int])
- extends SelectorModel[ANOVASelectorModel] (uid, selectedFeatures) {
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setOutputCol(value: String): this.type = super.setOutputCol(value)
-
- @Since("3.1.0")
- override def copy(extra: ParamMap): ANOVASelectorModel = {
- val copied = new ANOVASelectorModel(uid, selectedFeatures)
- .setParent(parent)
- copyValues(copied, extra)
- }
-
- @Since("3.1.0")
- override def write: MLWriter = new ANOVASelectorModel.ANOVASelectorModelWriter(this)
-
- @Since("3.1.0")
- override def toString: String = {
- s"ANOVASelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
- }
-}
-
-@Since("3.1.0")
-object ANOVASelectorModel extends MLReadable[ANOVASelectorModel] {
-
- @Since("3.1.0")
- override def read: MLReader[ANOVASelectorModel] = new ANOVASelectorModelReader
-
- @Since("3.1.0")
- override def load(path: String): ANOVASelectorModel = super.load(path)
-
- private[ANOVASelectorModel] class ANOVASelectorModelWriter(
- instance: ANOVASelectorModel) extends MLWriter {
-
- private case class Data(selectedFeatures: Seq[Int])
-
- override protected def saveImpl(path: String): Unit = {
- DefaultParamsWriter.saveMetadata(instance, path, sc)
- val data = Data(instance.selectedFeatures.toSeq)
- val dataPath = new Path(path, "data").toString
- sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
- }
- }
-
- private class ANOVASelectorModelReader extends MLReader[ANOVASelectorModel] {
-
- /** Checked against metadata when loading model */
- private val className = classOf[ANOVASelectorModel].getName
-
- override def load(path: String): ANOVASelectorModel = {
- val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
- val dataPath = new Path(path, "data").toString
- val data = sparkSession.read.parquet(dataPath)
- .select("selectedFeatures").head()
- val selectedFeatures = data.getAs[Seq[Int]](0).toArray
- val model = new ANOVASelectorModel(metadata.uid, selectedFeatures)
- metadata.getAndSetParams(model)
- model
- }
- }
-}
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 7f83b69..198a886 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -44,6 +44,7 @@ import org.apache.spark.sql.types.StructType
* By default, the selection method is `numTopFeatures`, with the default number of top features
* set to 50.
*/
+@deprecated("use UnivariateFeatureSelector instead", "3.1.0")
@Since("1.6.0")
final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
extends Selector[ChiSqSelectorModel] {
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala
deleted file mode 100644
index d177555..0000000
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala
+++ /dev/null
@@ -1,195 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.feature
-
-import org.apache.hadoop.fs.Path
-
-import org.apache.spark.annotation.Since
-import org.apache.spark.ml.param.ParamMap
-import org.apache.spark.ml.stat.FValueTest
-import org.apache.spark.ml.util._
-import org.apache.spark.sql.{DataFrame, Dataset}
-
-
-/**
- * F Value Regression feature selector, which selects continuous features to use for predicting a
- * continuous label.
- * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
- * `fdr`, `fwe`.
- * - `numTopFeatures` chooses a fixed number of top features according to a F value regression
- * test.
- * - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
- * - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
- * positive rate of selection.
- * - `fdr` uses the [Benjamini-Hochberg procedure]
- * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
- * to choose all features whose false discovery rate is below a threshold.
- * - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
- * 1/numFeatures, thus controlling the family-wise error rate of selection.
- * By default, the selection method is `numTopFeatures`, with the default number of top features
- * set to 50.
- */
-@Since("3.1.0")
-final class FValueSelector @Since("3.1.0") (@Since("3.1.0") override val uid: String) extends
- Selector[FValueSelectorModel] {
-
- @Since("3.1.0")
- def this() = this(Identifiable.randomUID("FValueSelector"))
-
- /** @group setParam */
- @Since("3.1.0")
- override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setPercentile(value: Double): this.type = super.setPercentile(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFpr(value: Double): this.type = super.setFpr(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFdr(value: Double): this.type = super.setFdr(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFwe(value: Double): this.type = super.setFwe(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setSelectorType(value: String): this.type = super.setSelectorType(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setOutputCol(value: String): this.type = super.setOutputCol(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setLabelCol(value: String): this.type = super.setLabelCol(value)
-
- /**
- * get the SelectionTestResult for every feature against the label
- */
- protected[this] override def getSelectionTestResult(df: DataFrame): DataFrame = {
- FValueTest.test(df, getFeaturesCol, getLabelCol, true)
- }
-
- /**
- * Create a new instance of concrete SelectorModel.
- * @param indices The indices of the selected features
- * @return A new SelectorModel instance
- */
- protected[this] def createSelectorModel(
- uid: String,
- indices: Array[Int]): FValueSelectorModel = {
- new FValueSelectorModel(uid, indices)
- }
-
- @Since("3.1.0")
- override def fit(dataset: Dataset[_]): FValueSelectorModel = {
- super.fit(dataset)
- }
-
- @Since("3.1.0")
- override def copy(extra: ParamMap): this.type = defaultCopy(extra)
-}
-
-@Since("3.1.0")
-object FValueSelector extends DefaultParamsReadable[FValueSelector] {
-
- @Since("3.1.0")
- override def load(path: String): FValueSelector = super.load(path)
-}
-
-/**
- * Model fitted by [[FValueSelector]]
- */
-@Since("3.1.0")
-class FValueSelectorModel private[ml](
- @Since("3.1.0") override val uid: String,
- @Since("3.1.0") override val selectedFeatures: Array[Int])
- extends SelectorModel[FValueSelectorModel] (uid, selectedFeatures) {
-
- /** @group setParam */
- @Since("3.1.0")
- override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
-
- /** @group setParam */
- @Since("3.1.0")
- override def setOutputCol(value: String): this.type = super.setOutputCol(value)
-
- @Since("3.1.0")
- override def copy(extra: ParamMap): FValueSelectorModel = {
- val copied = new FValueSelectorModel(uid, selectedFeatures)
- .setParent(parent)
- copyValues(copied, extra)
- }
-
- @Since("3.1.0")
- override def write: MLWriter = new FValueSelectorModel.FValueSelectorModelWriter(this)
-
- @Since("3.1.0")
- override def toString: String = {
- s"FValueSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
- }
-}
-
-@Since("3.1.0")
-object FValueSelectorModel extends MLReadable[FValueSelectorModel] {
-
- @Since("3.1.0")
- override def read: MLReader[FValueSelectorModel] = new FValueSelectorModelReader
-
- @Since("3.1.0")
- override def load(path: String): FValueSelectorModel = super.load(path)
-
- private[FValueSelectorModel] class FValueSelectorModelWriter(
- instance: FValueSelectorModel) extends MLWriter {
-
- private case class Data(selectedFeatures: Seq[Int])
-
- override protected def saveImpl(path: String): Unit = {
- DefaultParamsWriter.saveMetadata(instance, path, sc)
- val data = Data(instance.selectedFeatures.toSeq)
- val dataPath = new Path(path, "data").toString
- sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
- }
- }
-
- private class FValueSelectorModelReader extends MLReader[FValueSelectorModel] {
-
- /** Checked against metadata when loading model */
- private val className = classOf[FValueSelectorModel].getName
-
- override def load(path: String): FValueSelectorModel = {
- val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
- val dataPath = new Path(path, "data").toString
- val data = sparkSession.read.parquet(dataPath)
- .select("selectedFeatures").head()
- val selectedFeatures = data.getAs[Seq[Int]](0).toArray
- val model = new FValueSelectorModel(metadata.uid, selectedFeatures)
- metadata.getAndSetParams(model)
- model
- }
- }
-}
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
index 41de26d..cb8b71a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
@@ -133,10 +133,6 @@ private[feature] trait SelectorParams extends Params
* Super class for feature selectors.
* 1. Chi-Square Selector
* This feature selector is for categorical features and categorical labels.
- * 2. ANOVA F-value Classification Selector
- * This feature selector is for continuous features and categorical labels.
- * 3. Regression F-value Selector
- * This feature selector is for continuous features and continuous labels.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
* `fdr`, `fwe`.
* - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
@@ -279,11 +275,6 @@ private[ml] abstract class SelectorModel[T <: SelectorModel[T]] (
extends Model[T] with SelectorParams with MLWritable {
self: T =>
- if (selectedFeatures.length >= 2) {
- require(selectedFeatures.sliding(2).forall(l => l(0) < l(1)),
- "Index should be strictly increasing.")
- }
-
/** @group setParam */
@Since("3.1.0")
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
@@ -298,7 +289,8 @@ private[ml] abstract class SelectorModel[T <: SelectorModel[T]] (
override def transform(dataset: Dataset[_]): DataFrame = {
val outputSchema = transformSchema(dataset.schema, logging = true)
- SelectorModel.transform(dataset, selectedFeatures, outputSchema, $(outputCol), $(featuresCol))
+ SelectorModel.transform(dataset, selectedFeatures.sorted, outputSchema,
+ $(outputCol), $(featuresCol))
}
@Since("3.1.0")
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
new file mode 100644
index 0000000..6d5f09e
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
@@ -0,0 +1,467 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NominalAttribute, NumericAttribute}
+import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, HasOutputCol}
+import org.apache.spark.ml.stat.{ANOVATest, ChiSquareTest, FValueTest}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+
+/**
+ * Params for [[UnivariateFeatureSelector]] and [[UnivariateFeatureSelectorModel]].
+ */
+private[feature] trait UnivariateFeatureSelectorParams extends Params
+ with HasFeaturesCol with HasLabelCol with HasOutputCol {
+
+ /**
+ * The feature type.
+ * Supported options: "categorical", "continuous"
+ * @group param
+ */
+ @Since("3.1.1")
+ final val featureType = new Param[String](this, "featureType",
+ "Feature type. Supported options: categorical, continuous.",
+ ParamValidators.inArray(Array("categorical", "continuous")))
+
+ /** @group getParam */
+ @Since("3.1.1")
+ def getFeatureType: String = $(featureType)
+
+ /**
+ * The label type.
+ * Supported options: "categorical", "continuous"
+ * @group param
+ */
+ @Since("3.1.1")
+ final val labelType = new Param[String](this, "labelType",
+ "Label type. Supported options: categorical, continuous.",
+ ParamValidators.inArray(Array("categorical", "continuous")))
+
+ /** @group getParam */
+ @Since("3.1.1")
+ def getLabelType: String = $(labelType)
+
+ /**
+ * The selection mode.
+ * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+ * @group param
+ */
+ @Since("3.1.1")
+ final val selectionMode = new Param[String](this, "selectionMode",
+ "The selection mode. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+ ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+ "fwe")))
+
+ /** @group getParam */
+ @Since("3.1.1")
+ def getSelectionMode: String = $(selectionMode)
+
+ /**
+ * The upper bound of the features that selector will select.
+ * @group param
+ */
+ @Since("3.1.1")
+ final val selectionThreshold = new DoubleParam(this, "selectionThreshold",
+ "The upper bound of the features that selector will select.")
+
+ /** @group getParam */
+ def getSelectionThreshold: Double = $(selectionThreshold)
+
+ setDefault(selectionMode -> "numTopFeatures")
+}
+
+/**
+ * The user can set `featureType` and labelType`, and Spark will pick the score function based on
+ * the specified `featureType` and labelType`.
+ * The following combination of `featureType` and `labelType` are supported:
+ * - `featureType` `categorical` and `labelType` `categorical`: Spark uses chi2.
+ * - `featureType` `continuous` and `labelType` `categorical`: Spark uses f_classif.
+ * - `featureType` `continuous` and `labelType` `continuous`: Spark uses f_regression.
+ *
+ * The `UnivariateFeatureSelector` supports different selection modes: `numTopFeatures`,
+ * `percentile`, `fpr`, `fdr`, `fwe`.
+ * - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ * - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ * - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ * positive rate of selection.
+ * - `fdr` uses the <a href=
+ * "https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure">
+ * Benjamini-Hochberg procedure</a>
+ * to choose all features whose false discovery rate is below a threshold.
+ * - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ * 1/numFeatures, thus controlling the family-wise error rate of selection.
+ *
+ * By default, the selection mode is `numTopFeatures`.
+ */
+@Since("3.1.1")
+final class UnivariateFeatureSelector @Since("3.1.1")(@Since("3.1.1") override val uid: String)
+ extends Estimator[UnivariateFeatureSelectorModel] with UnivariateFeatureSelectorParams
+ with DefaultParamsWritable {
+
+ @Since("3.1.1")
+ def this() = this(Identifiable.randomUID("UnivariateFeatureSelector"))
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setSelectionMode(value: String): this.type = set(selectionMode, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setSelectionThreshold(value: Double): this.type = set(selectionThreshold, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setFeatureType(value: String): this.type = set(featureType, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setLabelType(value: String): this.type = set(labelType, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setOutputCol(value: String): this.type = set(outputCol, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setLabelCol(value: String): this.type = set(labelCol, value)
+
+ @Since("3.1.1")
+ override def fit(dataset: Dataset[_]): UnivariateFeatureSelectorModel = {
+ transformSchema(dataset.schema, logging = true)
+ val numFeatures = MetadataUtils.getNumFeatures(dataset, $(featuresCol))
+
+ $(selectionMode) match {
+ case ("numTopFeatures") =>
+ if (!isSet(selectionThreshold)) {
+ set(selectionThreshold, 50.0)
+ } else {
+ require($(selectionThreshold) > 0 && $(selectionThreshold).toInt == $(selectionThreshold),
+ "selectionThreshold needs to be a positive Integer for selection mode numTopFeatures")
+ }
+ case ("percentile") =>
+ if (!isSet(selectionThreshold)) {
+ set(selectionThreshold, 0.1)
+ } else {
+ require($(selectionThreshold) >= 0 && $(selectionThreshold) <= 1,
+ "selectionThreshold needs to be in the range of 0 to 1 for selection mode percentile")
+ }
+ case ("fpr") =>
+ if (!isSet(selectionThreshold)) {
+ set(selectionThreshold, 0.05)
+ } else {
+ require($(selectionThreshold) >= 0 && $(selectionThreshold) <= 1,
+ "selectionThreshold needs to be in the range of 0 to 1 for selection mode fpr")
+ }
+ case ("fdr") =>
+ if (!isSet(selectionThreshold)) {
+ set(selectionThreshold, 0.05)
+ } else {
+ require($(selectionThreshold) >= 0 && $(selectionThreshold) <= 1,
+ "selectionThreshold needs to be in the range of 0 to 1 for selection mode fdr")
+ }
+ case ("fwe") =>
+ if (!isSet(selectionThreshold)) {
+ set(selectionThreshold, 0.05)
+ } else {
+ require($(selectionThreshold) >= 0 && $(selectionThreshold) <= 1,
+ "selectionThreshold needs to be in the range of 0 to 1 for selection mode fwe")
+ }
+ case _ =>
+ throw new IllegalArgumentException(s"Unsupported selection mode:" +
+ s" selectionMode=${$(selectionMode)}")
+ }
+
+ require(isSet(featureType) && isSet(labelType), "featureType and labelType need to be set")
+ val resultDF = ($(featureType), $(labelType)) match {
+ case ("categorical", "categorical") =>
+ ChiSquareTest.test(dataset.toDF, getFeaturesCol, getLabelCol, true)
+ case ("continuous", "categorical") =>
+ ANOVATest.test(dataset.toDF, getFeaturesCol, getLabelCol, true)
+ case ("continuous", "continuous") =>
+ FValueTest.test(dataset.toDF, getFeaturesCol, getLabelCol, true)
+ case _ =>
+ throw new IllegalArgumentException(s"Unsupported combination:" +
+ s" featureType=${$(featureType)}, labelType=${$(labelType)}")
+ }
+
+ val indices =
+ selectIndicesFromPValues(numFeatures, resultDF, $(selectionMode), $(selectionThreshold))
+
+ copyValues(new UnivariateFeatureSelectorModel(uid, indices)
+ .setParent(this))
+ }
+
+ def getTopIndices(df: DataFrame, k: Int): Array[Int] = {
+ val spark = SparkSession.builder().getOrCreate()
+ import spark.implicits._
+ df.sort("pValue", "featureIndex")
+ .select("featureIndex")
+ .limit(k)
+ .as[Int]
+ .collect()
+ }
+
+ def selectIndicesFromPValues(
+ numFeatures: Int,
+ resultDF: DataFrame,
+ selectionMode: String,
+ selectionThreshold: Double): Array[Int] = {
+ val spark = SparkSession.builder().getOrCreate()
+ import spark.implicits._
+ val indices = selectionMode match {
+ case "numTopFeatures" =>
+ getTopIndices(resultDF, selectionThreshold.toInt)
+ case "percentile" =>
+ getTopIndices(resultDF, (numFeatures * selectionThreshold).toInt)
+ case "fpr" =>
+ resultDF.select("featureIndex")
+ .where(col("pValue") < selectionThreshold)
+ .as[Int].collect()
+ case "fdr" =>
+ // This uses the Benjamini-Hochberg procedure.
+ // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
+ val f = selectionThreshold / numFeatures
+ val maxIndex = resultDF.sort("pValue", "featureIndex")
+ .select("pValue")
+ .as[Double].rdd
+ .zipWithIndex
+ .flatMap { case (pValue, index) =>
+ if (pValue <= f * (index + 1)) {
+ Iterator.single(index.toInt)
+ } else Iterator.empty
+ }.fold(-1)(math.max)
+ if (maxIndex >= 0) {
+ getTopIndices(resultDF, maxIndex + 1)
+ } else Array.emptyIntArray
+ case "fwe" =>
+ resultDF.select("featureIndex")
+ .where(col("pValue") < selectionThreshold / numFeatures)
+ .as[Int].collect()
+ case errorType =>
+ throw new IllegalArgumentException(s"Unknown Selector Type: $errorType")
+ }
+ indices
+ }
+
+ @Since("3.1.1")
+ override def transformSchema(schema: StructType): StructType = {
+ SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+ SchemaUtils.checkNumericType(schema, $(labelCol))
+ SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+ }
+
+ @Since("3.1.1")
+ override def copy(extra: ParamMap): UnivariateFeatureSelector = defaultCopy(extra)
+}
+
+@Since("3.1.1")
+object UnivariateFeatureSelector extends DefaultParamsReadable[UnivariateFeatureSelector] {
+
+ @Since("3.1.1")
+ override def load(path: String): UnivariateFeatureSelector = super.load(path)
+}
+
+/**
+ * Model fitted by [[UnivariateFeatureSelectorModel]].
+ */
+@Since("3.1.1")
+class UnivariateFeatureSelectorModel private[ml](
+ @Since("3.1.1") override val uid: String,
+ @Since("3.1.1") val selectedFeatures: Array[Int])
+ extends Model[UnivariateFeatureSelectorModel] with UnivariateFeatureSelectorParams
+ with MLWritable {
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+ /** @group setParam */
+ @Since("3.1.1")
+ def setOutputCol(value: String): this.type = set(outputCol, value)
+
+ protected def isNumericAttribute = true
+
+ @Since("3.1.1")
+ override def transform(dataset: Dataset[_]): DataFrame = {
+ val outputSchema = transformSchema(dataset.schema, logging = true)
+
+ UnivariateFeatureSelectorModel
+ .transform(dataset, selectedFeatures.sorted, outputSchema, $(outputCol), $(featuresCol))
+ }
+
+ @Since("3.1.1")
+ override def transformSchema(schema: StructType): StructType = {
+ SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+ val newField =
+ UnivariateFeatureSelectorModel
+ .prepOutputField(schema, selectedFeatures, $(outputCol), $(featuresCol), isNumericAttribute)
+ SchemaUtils.appendColumn(schema, newField)
+ }
+
+ @Since("3.1.1")
+ override def copy(extra: ParamMap): UnivariateFeatureSelectorModel = {
+ val copied = new UnivariateFeatureSelectorModel(uid, selectedFeatures)
+ .setParent(parent)
+ copyValues(copied, extra)
+ }
+
+ @Since("3.1.1")
+ override def write: MLWriter =
+ new UnivariateFeatureSelectorModel.UnivariateFeatureSelectorModelWriter(this)
+
+ @Since("3.1.1")
+ override def toString: String = {
+ s"UnivariateFeatureSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
+ }
+}
+
+@Since("3.1.1")
+object UnivariateFeatureSelectorModel extends MLReadable[UnivariateFeatureSelectorModel] {
+
+ @Since("3.1.1")
+ override def read: MLReader[UnivariateFeatureSelectorModel] =
+ new UnivariateFeatureSelectorModelReader
+
+ @Since("3.1.1")
+ override def load(path: String): UnivariateFeatureSelectorModel = super.load(path)
+
+ private[UnivariateFeatureSelectorModel] class UnivariateFeatureSelectorModelWriter(
+ instance: UnivariateFeatureSelectorModel) extends MLWriter {
+
+ private case class Data(selectedFeatures: Seq[Int])
+
+ override protected def saveImpl(path: String): Unit = {
+ DefaultParamsWriter.saveMetadata(instance, path, sc)
+ val data = Data(instance.selectedFeatures.toSeq)
+ val dataPath = new Path(path, "data").toString
+ sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+ }
+ }
+
+ private class UnivariateFeatureSelectorModelReader
+ extends MLReader[UnivariateFeatureSelectorModel] {
+
+ /** Checked against metadata when loading model */
+ private val className = classOf[UnivariateFeatureSelectorModel].getName
+
+ override def load(path: String): UnivariateFeatureSelectorModel = {
+ val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+ val dataPath = new Path(path, "data").toString
+ val data = sparkSession.read.parquet(dataPath)
+ .select("selectedFeatures").head()
+ val selectedFeatures = data.getAs[Seq[Int]](0).toArray
+ val model = new UnivariateFeatureSelectorModel(metadata.uid, selectedFeatures)
+ metadata.getAndSetParams(model)
+ model
+ }
+ }
+
+ private def transform(
+ dataset: Dataset[_],
+ selectedFeatures: Array[Int],
+ outputSchema: StructType,
+ outputCol: String,
+ featuresCol: String): DataFrame = {
+ val newSize = selectedFeatures.length
+ val func = { vector: Vector =>
+ vector match {
+ case SparseVector(_, indices, values) =>
+ val (newIndices, newValues) =
+ compressSparse(indices, values, selectedFeatures)
+ Vectors.sparse(newSize, newIndices, newValues)
+ case DenseVector(values) =>
+ Vectors.dense(selectedFeatures.map(values))
+ case other =>
+ throw new UnsupportedOperationException(
+ s"Only sparse and dense vectors are supported but got ${other.getClass}.")
+ }
+ }
+
+ val transformer = udf(func)
+ dataset.withColumn(outputCol, transformer(col(featuresCol)),
+ outputSchema(outputCol).metadata)
+ }
+
+ /**
+ * Prepare the output column field, including per-feature metadata.
+ */
+ private def prepOutputField(
+ schema: StructType,
+ selectedFeatures: Array[Int],
+ outputCol: String,
+ featuresCol: String,
+ isNumericAttribute: Boolean): StructField = {
+ val selector = selectedFeatures.toSet
+ val origAttrGroup = AttributeGroup.fromStructField(schema(featuresCol))
+ val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
+ origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
+ } else {
+ if (isNumericAttribute) {
+ Array.fill[Attribute](selector.size)(NumericAttribute.defaultAttr)
+ } else {
+ Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
+ }
+ }
+ val newAttributeGroup = new AttributeGroup(outputCol, featureAttributes)
+ newAttributeGroup.toStructField()
+ }
+
+ private def compressSparse(
+ indices: Array[Int],
+ values: Array[Double],
+ selectedFeatures: Array[Int]): (Array[Int], Array[Double]) = {
+ val newValues = new ArrayBuilder.ofDouble
+ val newIndices = new ArrayBuilder.ofInt
+ var i = 0
+ var j = 0
+ while (i < indices.length && j < selectedFeatures.length) {
+ val indicesIdx = indices(i)
+ val filterIndicesIdx = selectedFeatures(j)
+ if (indicesIdx == filterIndicesIdx) {
+ newIndices += j
+ newValues += values(i)
+ j += 1
+ i += 1
+ } else {
+ if (indicesIdx > filterIndicesIdx) {
+ j += 1
+ } else {
+ i += 1
+ }
+ }
+ }
+ // TODO: Sparse representation might be ineffective if (newSize ~= newValues.size)
+ (newIndices.result(), newValues.result())
+ }
+}
diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/ANOVATest.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/ANOVATest.scala
index f14f63b..7a7e76c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/stat/ANOVATest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/stat/ANOVATest.scala
@@ -35,7 +35,7 @@ import org.apache.spark.util.collection.OpenHashMap
* information on ANOVA test.
*/
@Since("3.1.0")
-object ANOVATest {
+private[ml] object ANOVATest {
/**
* @param dataset DataFrame of categorical labels and continuous features.
diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/FValueTest.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/FValueTest.scala
index ad506ab..f315e92 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/stat/FValueTest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/stat/FValueTest.scala
@@ -30,7 +30,7 @@ import org.apache.spark.sql.functions._
* FValue test for continuous data.
*/
@Since("3.1.0")
-object FValueTest {
+private[ml] object FValueTest {
/** Used to construct output schema of tests */
private case class FValueResult(
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/ANOVASelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/ANOVASelectorSuite.scala
deleted file mode 100644
index 0d664e4..0000000
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/ANOVASelectorSuite.scala
+++ /dev/null
@@ -1,206 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.feature
-
-import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.param.ParamsSuite
-import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils}
-import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.sql.{Dataset, Row}
-
-class ANOVASelectorSuite extends MLTest with DefaultReadWriteTest {
-
- import testImplicits._
-
- @transient var dataset: Dataset[_] = _
-
- override def beforeAll(): Unit = {
- super.beforeAll()
-
- // scalastyle:off
- /*
- X:
- array([[4.65415496e-03, 1.03550567e-01, -1.17358140e+00,
- 1.61408773e-01, 3.92492111e-01, 7.31240882e-01],
- [-9.01651741e-01, -5.28905302e-01, 1.27636785e+00,
- 7.02154563e-01, 6.21348351e-01, 1.88397353e-01],
- [ 3.85692159e-01, -9.04639637e-01, 5.09782604e-02,
- 8.40043971e-01, 7.45977857e-01, 8.78402288e-01],
- [ 1.36264353e+00, 2.62454094e-01, 7.96306202e-01,
- 6.14948000e-01, 7.44948187e-01, 9.74034830e-01],
- [ 9.65874070e-01, 2.52773665e+00, -2.19380094e+00,
- 2.33408080e-01, 1.86340919e-01, 8.23390433e-01],
- [ 1.12324305e+01, -2.77121515e-01, 1.12740513e-01,
- 2.35184013e-01, 3.46668895e-01, 9.38500782e-02],
- [ 1.06195839e+01, -1.82891238e+00, 2.25085601e-01,
- 9.09979851e-01, 6.80257535e-02, 8.24017480e-01],
- [ 1.12806837e+01, 1.30686889e+00, 9.32839108e-02,
- 3.49784755e-01, 1.71322408e-02, 7.48465194e-02],
- [ 9.98689462e+00, 9.50808938e-01, -2.90786359e-01,
- 2.31253009e-01, 7.46270968e-01, 1.60308169e-01],
- [ 1.08428551e+01, -1.02749936e+00, 1.73951508e-01,
- 8.92482744e-02, 1.42651730e-01, 7.66751625e-01],
- [-1.98641448e+00, 1.12811990e+01, -2.35246756e-01,
- 8.22809049e-01, 3.26739456e-01, 7.88268404e-01],
- [-6.09864090e-01, 1.07346276e+01, -2.18805509e-01,
- 7.33931213e-01, 1.42554396e-01, 7.11225605e-01],
- [-1.58481268e+00, 9.19364039e+00, -5.87490459e-02,
- 2.51532056e-01, 2.82729807e-01, 7.16245686e-01],
- [-2.50949277e-01, 1.12815254e+01, -6.94806734e-01,
- 5.93898886e-01, 5.68425656e-01, 8.49762330e-01],
- [ 7.63485129e-01, 1.02605138e+01, 1.32617719e+00,
- 5.49682879e-01, 8.59931442e-01, 4.88677978e-02],
- [ 9.34900015e-01, 4.11379043e-01, 8.65010205e+00,
- 9.23509168e-01, 1.16995043e-01, 5.91894106e-03],
- [ 4.73734933e-01, -1.48321181e+00, 9.73349621e+00,
- 4.09421563e-01, 5.09375719e-01, 5.93157850e-01],
- [ 3.41470679e-01, -6.88972582e-01, 9.60347938e+00,
- 3.62654055e-01, 2.43437468e-01, 7.13052838e-01],
- [-5.29614251e-01, -1.39262856e+00, 1.01354144e+01,
- 8.24123861e-01, 5.84074506e-01, 6.54461558e-01],
- [-2.99454508e-01, 2.20457263e+00, 1.14586015e+01,
- 5.16336729e-01, 9.99776159e-01, 3.15769738e-01]])
- y:
- array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4])
- scikit-learn result:
- >>> f_classif(X, y)
- (array([228.27701422, 84.33070501, 134.25330675, 0.82211775, 0.82991363, 1.08478943]),
- array([2.43864448e-13, 5.09088367e-10, 1.49033067e-11, 5.00596446e-01, 4.96684374e-01, 3.83798191e-01]))
- */
- // scalastyle:on
-
- val data = Seq(
- (1, Vectors.dense(4.65415496e-03, 1.03550567e-01, -1.17358140e+00,
- 1.61408773e-01, 3.92492111e-01, 7.31240882e-01), Vectors.dense(4.65415496e-03)),
- (1, Vectors.dense(-9.01651741e-01, -5.28905302e-01, 1.27636785e+00,
- 7.02154563e-01, 6.21348351e-01, 1.88397353e-01), Vectors.dense(-9.01651741e-01)),
- (1, Vectors.dense(3.85692159e-01, -9.04639637e-01, 5.09782604e-02,
- 8.40043971e-01, 7.45977857e-01, 8.78402288e-01), Vectors.dense(3.85692159e-01)),
- (1, Vectors.dense(1.36264353e+00, 2.62454094e-01, 7.96306202e-01,
- 6.14948000e-01, 7.44948187e-01, 9.74034830e-01), Vectors.dense(1.36264353e+00)),
- (1, Vectors.dense(9.65874070e-01, 2.52773665e+00, -2.19380094e+00,
- 2.33408080e-01, 1.86340919e-01, 8.23390433e-01), Vectors.dense(9.65874070e-01)),
- (2, Vectors.dense(1.12324305e+01, -2.77121515e-01, 1.12740513e-01,
- 2.35184013e-01, 3.46668895e-01, 9.38500782e-02), Vectors.dense(1.12324305e+01)),
- (2, Vectors.dense(1.06195839e+01, -1.82891238e+00, 2.25085601e-01,
- 9.09979851e-01, 6.80257535e-02, 8.24017480e-01), Vectors.dense(1.06195839e+01)),
- (2, Vectors.dense(1.12806837e+01, 1.30686889e+00, 9.32839108e-02,
- 3.49784755e-01, 1.71322408e-02, 7.48465194e-02), Vectors.dense(1.12806837e+01)),
- (2, Vectors.dense(9.98689462e+00, 9.50808938e-01, -2.90786359e-01,
- 2.31253009e-01, 7.46270968e-01, 1.60308169e-01), Vectors.dense(9.98689462e+00)),
- (2, Vectors.dense(1.08428551e+01, -1.02749936e+00, 1.73951508e-01,
- 8.92482744e-02, 1.42651730e-01, 7.66751625e-01), Vectors.dense(1.08428551e+01)),
- (3, Vectors.dense(-1.98641448e+00, 1.12811990e+01, -2.35246756e-01,
- 8.22809049e-01, 3.26739456e-01, 7.88268404e-01), Vectors.dense(-1.98641448e+00)),
- (3, Vectors.dense(-6.09864090e-01, 1.07346276e+01, -2.18805509e-01,
- 7.33931213e-01, 1.42554396e-01, 7.11225605e-01), Vectors.dense(-6.09864090e-01)),
- (3, Vectors.dense(-1.58481268e+00, 9.19364039e+00, -5.87490459e-02,
- 2.51532056e-01, 2.82729807e-01, 7.16245686e-01), Vectors.dense(-1.58481268e+00)),
- (3, Vectors.dense(-2.50949277e-01, 1.12815254e+01, -6.94806734e-01,
- 5.93898886e-01, 5.68425656e-01, 8.49762330e-01), Vectors.dense(-2.50949277e-01)),
- (3, Vectors.dense(7.63485129e-01, 1.02605138e+01, 1.32617719e+00,
- 5.49682879e-01, 8.59931442e-01, 4.88677978e-02), Vectors.dense(7.63485129e-01)),
- (4, Vectors.dense(9.34900015e-01, 4.11379043e-01, 8.65010205e+00,
- 9.23509168e-01, 1.16995043e-01, 5.91894106e-03), Vectors.dense(9.34900015e-01)),
- (4, Vectors.dense(4.73734933e-01, -1.48321181e+00, 9.73349621e+00,
- 4.09421563e-01, 5.09375719e-01, 5.93157850e-01), Vectors.dense(4.73734933e-01)),
- (4, Vectors.dense(3.41470679e-01, -6.88972582e-01, 9.60347938e+00,
- 3.62654055e-01, 2.43437468e-01, 7.13052838e-01), Vectors.dense(3.41470679e-01)),
- (4, Vectors.dense(-5.29614251e-01, -1.39262856e+00, 1.01354144e+01,
- 8.24123861e-01, 5.84074506e-01, 6.54461558e-01), Vectors.dense(-5.29614251e-01)),
- (4, Vectors.dense(-2.99454508e-01, 2.20457263e+00, 1.14586015e+01,
- 5.16336729e-01, 9.99776159e-01, 3.15769738e-01), Vectors.dense(-2.99454508e-01)))
-
- dataset = spark.createDataFrame(data).toDF("label", "features", "topFeature")
- }
-
- test("params") {
- ParamsSuite.checkParams(new ANOVASelector())
- }
-
- test("Test ANOVAFValue classification selector: numTopFeatures") {
- val selector = new ANOVASelector()
- .setOutputCol("filtered").setSelectorType("numTopFeatures").setNumTopFeatures(1)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test ANOVAFValue classification selector: percentile") {
- val selector = new ANOVASelector()
- .setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.17)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test ANOVAFValue classification selector: fpr") {
- val selector = new ANOVASelector()
- .setOutputCol("filtered").setSelectorType("fpr").setFpr(1.0E-12)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test ANOVAFValue classification selector: fdr") {
- val selector = new ANOVASelector()
- .setOutputCol("filtered").setSelectorType("fdr").setFdr(6.0E-12)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test ANOVAFValue classification selector: fwe") {
- val selector = new ANOVASelector()
- .setOutputCol("filtered").setSelectorType("fwe").setFwe(6.0E-12)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("read/write") {
- def checkModelData(model: ANOVASelectorModel, model2: ANOVASelectorModel): Unit = {
- assert(model.selectedFeatures === model2.selectedFeatures)
- }
- val anovaSelector = new ANOVASelector()
- testEstimatorAndModelReadWrite(anovaSelector, dataset,
- ANOVASelectorSuite.allParamSettings,
- ANOVASelectorSuite.allParamSettings, checkModelData)
- }
-
- private def testSelector(selector: ANOVASelector, data: Dataset[_]):
- ANOVASelectorModel = {
- val selectorModel = selector.fit(data)
- testTransformer[(Double, Vector, Vector)](data.toDF(), selectorModel,
- "filtered", "topFeature") {
- case Row(vec1: Vector, vec2: Vector) =>
- assert(vec1 ~== vec2 absTol 1e-1)
- }
- selectorModel
- }
-}
-
-object ANOVASelectorSuite {
-
- /**
- * Mapping from all Params to valid settings which differ from the defaults.
- * This is useful for tests which need to exercise all Params, such as save/load.
- * This excludes input columns to simplify some tests.
- */
- val allParamSettings: Map[String, Any] = Map(
- "selectorType" -> "percentile",
- "numTopFeatures" -> 1,
- "percentile" -> 0.12,
- "outputCol" -> "myOutput"
- )
-}
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/FValueSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/FValueSelectorSuite.scala
deleted file mode 100644
index 5c12001..0000000
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/FValueSelectorSuite.scala
+++ /dev/null
@@ -1,238 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.feature
-
-import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.param.ParamsSuite
-import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils}
-import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.sql.{Dataset, Row}
-
-class FValueSelectorSuite extends MLTest with DefaultReadWriteTest {
-
- import testImplicits._
-
- @transient var dataset: Dataset[_] = _
-
- override def beforeAll(): Unit = {
- super.beforeAll()
-
- // scalastyle:off
- /*
- Use the following sklearn data in this test
-
- >>> from sklearn.feature_selection import f_regression
- >>> import numpy as np
- >>> np.random.seed(777)
- >>> X = np.random.rand(20, 6)
- >>> w = np.array([0.3, 0.4, 0.5, 0, 0, 0])
- >>> y = X @ w
- >>> X
- array([[0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
- 0.27259261],
- [0.27646426, 0.80187218, 0.95813935, 0.87593263, 0.35781727,
- 0.50099513],
- [0.68346294, 0.71270203, 0.37025075, 0.56119619, 0.50308317,
- 0.01376845],
- [0.77282662, 0.88264119, 0.36488598, 0.61539618, 0.07538124,
- 0.36882401],
- [0.9331401 , 0.65137814, 0.39720258, 0.78873014, 0.31683612,
- 0.56809865],
- [0.86912739, 0.43617342, 0.80214764, 0.14376682, 0.70426097,
- 0.70458131],
- [0.21879211, 0.92486763, 0.44214076, 0.90931596, 0.05980922,
- 0.18428708],
- [0.04735528, 0.67488094, 0.59462478, 0.53331016, 0.04332406,
- 0.56143308],
- [0.32966845, 0.50296683, 0.11189432, 0.60719371, 0.56594464,
- 0.00676406],
- [0.61744171, 0.91212289, 0.79052413, 0.99208147, 0.95880176,
- 0.79196414],
- [0.28525096, 0.62491671, 0.4780938 , 0.19567518, 0.38231745,
- 0.05387369],
- [0.45164841, 0.98200474, 0.1239427 , 0.1193809 , 0.73852306,
- 0.58730363],
- [0.47163253, 0.10712682, 0.22921857, 0.89996519, 0.41675354,
- 0.53585166],
- [0.00620852, 0.30064171, 0.43689317, 0.612149 , 0.91819808,
- 0.62573667],
- [0.70599757, 0.14983372, 0.74606341, 0.83100699, 0.63372577,
- 0.43830988],
- [0.15257277, 0.56840962, 0.52822428, 0.95142876, 0.48035918,
- 0.50255956],
- [0.53687819, 0.81920207, 0.05711564, 0.66942174, 0.76711663,
- 0.70811536],
- [0.79686718, 0.55776083, 0.96583653, 0.1471569 , 0.029647 ,
- 0.59389349],
- [0.1140657 , 0.95080985, 0.32570741, 0.19361869, 0.45781165,
- 0.92040257],
- [0.87906916, 0.25261576, 0.34800879, 0.18258873, 0.90179605,
- 0.70652816]])
- >>> y
- array([0.52516321, 0.88275782, 0.67524507, 0.76734745, 0.73909458,
- 0.83628141, 0.65665506, 0.58147135, 0.35603443, 0.94534373,
- 0.57458887, 0.59026777, 0.29894977, 0.34056582, 0.64476446,
- 0.53724782, 0.5173021 , 0.94508275, 0.57739736, 0.53877145])
- >>> f_regression(X, y)
- (array([5.58025504, 3.98311705, 20.59605518, 0.07993376, 1.25127646,
- 0.7676937 ]),
- array([2.96302196e-02, 6.13173918e-02, 2.54580618e-04, 7.80612726e-01,
- 2.78015517e-01, 3.92474567e-01]))
- */
- // scalastyle:on
-
- val data = Seq(
- (0.52516321, Vectors.dense(0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
- 0.27259261), Vectors.dense(0.43772774)),
- (0.88275782, Vectors.dense(0.27646426, 0.80187218, 0.95813935, 0.87593263, 0.35781727,
- 0.50099513), Vectors.dense(0.95813935)),
- (0.67524507, Vectors.dense(0.68346294, 0.71270203, 0.37025075, 0.56119619, 0.50308317,
- 0.01376845), Vectors.dense(0.37025075)),
- (0.76734745, Vectors.dense(0.77282662, 0.88264119, 0.36488598, 0.61539618, 0.07538124,
- 0.36882401), Vectors.dense(0.36488598)),
- (0.73909458, Vectors.dense(0.9331401, 0.65137814, 0.39720258, 0.78873014, 0.31683612,
- 0.56809865), Vectors.dense(0.39720258)),
-
- (0.83628141, Vectors.dense(0.86912739, 0.43617342, 0.80214764, 0.14376682, 0.70426097,
- 0.70458131), Vectors.dense(0.80214764)),
- (0.65665506, Vectors.dense(0.21879211, 0.92486763, 0.44214076, 0.90931596, 0.05980922,
- 0.18428708), Vectors.dense(0.44214076)),
- (0.58147135, Vectors.dense(0.04735528, 0.67488094, 0.59462478, 0.53331016, 0.04332406,
- 0.56143308), Vectors.dense(0.59462478)),
- (0.35603443, Vectors.dense(0.32966845, 0.50296683, 0.11189432, 0.60719371, 0.56594464,
- 0.00676406), Vectors.dense(0.11189432)),
- (0.94534373, Vectors.dense(0.61744171, 0.91212289, 0.79052413, 0.99208147, 0.95880176,
- 0.79196414), Vectors.dense(0.79052413)),
-
- (0.57458887, Vectors.dense(0.28525096, 0.62491671, 0.4780938, 0.19567518, 0.38231745,
- 0.05387369), Vectors.dense(0.4780938)),
- (0.59026777, Vectors.dense(0.45164841, 0.98200474, 0.1239427, 0.1193809, 0.73852306,
- 0.58730363), Vectors.dense(0.1239427)),
- (0.29894977, Vectors.dense(0.47163253, 0.10712682, 0.22921857, 0.89996519, 0.41675354,
- 0.53585166), Vectors.dense(0.22921857)),
- (0.34056582, Vectors.dense(0.00620852, 0.30064171, 0.43689317, 0.612149, 0.91819808,
- 0.62573667), Vectors.dense(0.43689317)),
- (0.64476446, Vectors.dense(0.70599757, 0.14983372, 0.74606341, 0.83100699, 0.63372577,
- 0.43830988), Vectors.dense(0.74606341)),
-
- (0.53724782, Vectors.dense(0.15257277, 0.56840962, 0.52822428, 0.95142876, 0.48035918,
- 0.50255956), Vectors.dense(0.52822428)),
- (0.5173021, Vectors.dense(0.53687819, 0.81920207, 0.05711564, 0.66942174, 0.76711663,
- 0.70811536), Vectors.dense(0.05711564)),
- (0.94508275, Vectors.dense(0.79686718, 0.55776083, 0.96583653, 0.1471569, 0.029647,
- 0.59389349), Vectors.dense(0.96583653)),
- (0.57739736, Vectors.dense(0.1140657, 0.95080985, 0.96583653, 0.19361869, 0.45781165,
- 0.92040257), Vectors.dense(0.96583653)),
- (0.53877145, Vectors.dense(0.87906916, 0.25261576, 0.34800879, 0.18258873, 0.90179605,
- 0.70652816), Vectors.dense(0.34800879)))
-
- dataset = spark.createDataFrame(data).toDF("label", "features", "topFeature")
- }
-
- test("params") {
- ParamsSuite.checkParams(new FValueSelector)
- }
-
- test("Test FValue selector: numTopFeatures") {
- val selector = new FValueSelector()
- .setOutputCol("filtered").setSelectorType("numTopFeatures").setNumTopFeatures(1)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test F Value selector: percentile") {
- val selector = new FValueSelector()
- .setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.17)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test F Value selector: fpr") {
- val selector = new FValueSelector()
- .setOutputCol("filtered").setSelectorType("fpr").setFpr(0.01)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test F Value selector: fdr") {
- val selector = new FValueSelector()
- .setOutputCol("filtered").setSelectorType("fdr").setFdr(0.03)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test F Value selector: fwe") {
- val selector = new FValueSelector()
- .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.03)
- val model = testSelector(selector, dataset)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("Test FValue selector with sparse vector") {
- val df = spark.createDataFrame(Seq(
- (4.6, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0))), Vectors.dense(0.0)),
- (6.6, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0))), Vectors.dense(6.0)),
- (5.1, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0))), Vectors.dense(3.0)),
- (7.6, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)), Vectors.dense(8.0)),
- (9.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)), Vectors.dense(6.0)),
- (9.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)), Vectors.dense(6.0))
- )).toDF("label", "features", "topFeature")
-
- val selector = new FValueSelector()
- .setOutputCol("filtered").setSelectorType("numTopFeatures").setNumTopFeatures(1)
- val model = testSelector(selector, df)
- MLTestingUtils.checkCopyAndUids(selector, model)
- }
-
- test("read/write") {
- def checkModelData(model: FValueSelectorModel, model2:
- FValueSelectorModel): Unit = {
- assert(model.selectedFeatures === model2.selectedFeatures)
- }
- val fSelector = new FValueSelector
- testEstimatorAndModelReadWrite(fSelector, dataset,
- FValueSelectorSuite.allParamSettings,
- FValueSelectorSuite.allParamSettings, checkModelData)
- }
-
- private def testSelector(selector: FValueSelector, data: Dataset[_]):
- FValueSelectorModel = {
- val selectorModel = selector.fit(data)
- testTransformer[(Double, Vector, Vector)](data.toDF(), selectorModel,
- "filtered", "topFeature") {
- case Row(vec1: Vector, vec2: Vector) =>
- assert(vec1 ~== vec2 absTol 1e-6)
- }
- selectorModel
- }
-}
-
-object FValueSelectorSuite {
-
- /**
- * Mapping from all Params to valid settings which differ from the defaults.
- * This is useful for tests which need to exercise all Params, such as save/load.
- * This excludes input columns to simplify some tests.
- */
- val allParamSettings: Map[String, Any] = Map(
- "selectorType" -> "percentile",
- "numTopFeatures" -> 1,
- "percentile" -> 0.12,
- "outputCol" -> "myOutput"
- )
-}
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/UnivariateFeatureSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/UnivariateFeatureSelectorSuite.scala
new file mode 100644
index 0000000..84868dc
--- /dev/null
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/UnivariateFeatureSelectorSuite.scala
@@ -0,0 +1,685 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.stat.{ANOVATest, FValueTest}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.sql.{Dataset, Row}
+
+class UnivariateFeatureSelectorSuite extends MLTest with DefaultReadWriteTest {
+
+ import testImplicits._
+
+ @transient var datasetChi2: Dataset[_] = _
+ @transient var datasetAnova: Dataset[_] = _
+ @transient var datasetFRegression: Dataset[_] = _
+
+ private var selector1: UnivariateFeatureSelector = _
+ private var selector2: UnivariateFeatureSelector = _
+ private var selector3: UnivariateFeatureSelector = _
+
+ override def beforeAll(): Unit = {
+ super.beforeAll()
+ // Toy dataset, including the top feature for a chi-squared test.
+ // These data are chosen such that each feature's test has a distinct p-value.
+ /*
+ * Contingency tables
+ * feature1 = {6.0, 0.0, 8.0}
+ * class 0 1 2
+ * 6.0||1|0|0|
+ * 0.0||0|3|0|
+ * 8.0||0|0|2|
+ * degree of freedom = 4, statistic = 12, pValue = 0.017
+ *
+ * feature2 = {7.0, 9.0}
+ * class 0 1 2
+ * 7.0||1|0|0|
+ * 9.0||0|3|2|
+ * degree of freedom = 2, statistic = 6, pValue = 0.049
+ *
+ * feature3 = {0.0, 6.0, 3.0, 8.0}
+ * class 0 1 2
+ * 0.0||1|0|0|
+ * 6.0||0|1|2|
+ * 3.0||0|1|0|
+ * 8.0||0|1|0|
+ * degree of freedom = 6, statistic = 8.66, pValue = 0.193
+ *
+ * feature4 = {7.0, 0.0, 5.0, 4.0}
+ * class 0 1 2
+ * 7.0||1|0|0|
+ * 0.0||0|2|0|
+ * 5.0||0|1|1|
+ * 4.0||0|0|1|
+ * degree of freedom = 6, statistic = 9.5, pValue = 0.147
+ *
+ * feature5 = {6.0, 5.0, 4.0, 0.0}
+ * class 0 1 2
+ * 6.0||1|1|0|
+ * 5.0||0|2|0|
+ * 4.0||0|0|1|
+ * 0.0||0|0|1|
+ * degree of freedom = 6, statistic = 8.0, pValue = 0.238
+ *
+ * feature6 = {0.0, 9.0, 5.0, 4.0}
+ * class 0 1 2
+ * 0.0||1|0|1|
+ * 9.0||0|1|0|
+ * 5.0||0|1|0|
+ * 4.0||0|1|1|
+ * degree of freedom = 6, statistic = 5, pValue = 0.54
+ *
+ * To verify the results with R, run:
+ * library(stats)
+ * x1 <- c(6.0, 0.0, 0.0, 0.0, 8.0, 8.0)
+ * x2 <- c(7.0, 9.0, 9.0, 9.0, 9.0, 9.0)
+ * x3 <- c(0.0, 6.0, 3.0, 8.0, 6.0, 6.0)
+ * x4 <- c(7.0, 0.0, 0.0, 5.0, 5.0, 4.0)
+ * x5 <- c(6.0, 5.0, 5.0, 6.0, 4.0, 0.0)
+ * x6 <- c(0.0, 9.0, 5.0, 4.0, 4.0, 0.0)
+ * y <- c(0.0, 1.0, 1.0, 1.0, 2.0, 2.0)
+ * chisq.test(x1,y)
+ * chisq.test(x2,y)
+ * chisq.test(x3,y)
+ * chisq.test(x4,y)
+ * chisq.test(x5,y)
+ * chisq.test(x6,y)
+ */
+
+ datasetChi2 = spark.createDataFrame(Seq(
+ (0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0))), Vectors.dense(6.0)),
+ (1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0))), Vectors.dense(0.0)),
+ (1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0))), Vectors.dense(0.0)),
+ (1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)), Vectors.dense(0.0)),
+ (2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)), Vectors.dense(8.0)),
+ (2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)), Vectors.dense(8.0))
+ )).toDF("label", "features", "topFeature")
+
+ // scalastyle:off
+ /*
+ X:
+ array([[4.65415496e-03, 1.03550567e-01, -1.17358140e+00,
+ 1.61408773e-01, 3.92492111e-01, 7.31240882e-01],
+ [-9.01651741e-01, -5.28905302e-01, 1.27636785e+00,
+ 7.02154563e-01, 6.21348351e-01, 1.88397353e-01],
+ [ 3.85692159e-01, -9.04639637e-01, 5.09782604e-02,
+ 8.40043971e-01, 7.45977857e-01, 8.78402288e-01],
+ [ 1.36264353e+00, 2.62454094e-01, 7.96306202e-01,
+ 6.14948000e-01, 7.44948187e-01, 9.74034830e-01],
+ [ 9.65874070e-01, 2.52773665e+00, -2.19380094e+00,
+ 2.33408080e-01, 1.86340919e-01, 8.23390433e-01],
+ [ 1.12324305e+01, -2.77121515e-01, 1.12740513e-01,
+ 2.35184013e-01, 3.46668895e-01, 9.38500782e-02],
+ [ 1.06195839e+01, -1.82891238e+00, 2.25085601e-01,
+ 9.09979851e-01, 6.80257535e-02, 8.24017480e-01],
+ [ 1.12806837e+01, 1.30686889e+00, 9.32839108e-02,
+ 3.49784755e-01, 1.71322408e-02, 7.48465194e-02],
+ [ 9.98689462e+00, 9.50808938e-01, -2.90786359e-01,
+ 2.31253009e-01, 7.46270968e-01, 1.60308169e-01],
+ [ 1.08428551e+01, -1.02749936e+00, 1.73951508e-01,
+ 8.92482744e-02, 1.42651730e-01, 7.66751625e-01],
+ [-1.98641448e+00, 1.12811990e+01, -2.35246756e-01,
+ 8.22809049e-01, 3.26739456e-01, 7.88268404e-01],
+ [-6.09864090e-01, 1.07346276e+01, -2.18805509e-01,
+ 7.33931213e-01, 1.42554396e-01, 7.11225605e-01],
+ [-1.58481268e+00, 9.19364039e+00, -5.87490459e-02,
+ 2.51532056e-01, 2.82729807e-01, 7.16245686e-01],
+ [-2.50949277e-01, 1.12815254e+01, -6.94806734e-01,
+ 5.93898886e-01, 5.68425656e-01, 8.49762330e-01],
+ [ 7.63485129e-01, 1.02605138e+01, 1.32617719e+00,
+ 5.49682879e-01, 8.59931442e-01, 4.88677978e-02],
+ [ 9.34900015e-01, 4.11379043e-01, 8.65010205e+00,
+ 9.23509168e-01, 1.16995043e-01, 5.91894106e-03],
+ [ 4.73734933e-01, -1.48321181e+00, 9.73349621e+00,
+ 4.09421563e-01, 5.09375719e-01, 5.93157850e-01],
+ [ 3.41470679e-01, -6.88972582e-01, 9.60347938e+00,
+ 3.62654055e-01, 2.43437468e-01, 7.13052838e-01],
+ [-5.29614251e-01, -1.39262856e+00, 1.01354144e+01,
+ 8.24123861e-01, 5.84074506e-01, 6.54461558e-01],
+ [-2.99454508e-01, 2.20457263e+00, 1.14586015e+01,
+ 5.16336729e-01, 9.99776159e-01, 3.15769738e-01]])
+ y:
+ array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4])
+ scikit-learn result:
+ >>> f_classif(X, y)
+ (array([228.27701422, 84.33070501, 134.25330675, 0.82211775, 0.82991363, 1.08478943]),
+ array([2.43864448e-13, 5.09088367e-10, 1.49033067e-11, 5.00596446e-01, 4.96684374e-01, 3.83798191e-01]))
+ */
+ // scalastyle:on
+
+ val dataAnova = Seq(
+ (1, Vectors.dense(4.65415496e-03, 1.03550567e-01, -1.17358140e+00,
+ 1.61408773e-01, 3.92492111e-01, 7.31240882e-01), Vectors.dense(4.65415496e-03)),
+ (1, Vectors.dense(-9.01651741e-01, -5.28905302e-01, 1.27636785e+00,
+ 7.02154563e-01, 6.21348351e-01, 1.88397353e-01), Vectors.dense(-9.01651741e-01)),
+ (1, Vectors.dense(3.85692159e-01, -9.04639637e-01, 5.09782604e-02,
+ 8.40043971e-01, 7.45977857e-01, 8.78402288e-01), Vectors.dense(3.85692159e-01)),
+ (1, Vectors.dense(1.36264353e+00, 2.62454094e-01, 7.96306202e-01,
+ 6.14948000e-01, 7.44948187e-01, 9.74034830e-01), Vectors.dense(1.36264353e+00)),
+ (1, Vectors.dense(9.65874070e-01, 2.52773665e+00, -2.19380094e+00,
+ 2.33408080e-01, 1.86340919e-01, 8.23390433e-01), Vectors.dense(9.65874070e-01)),
+ (2, Vectors.dense(1.12324305e+01, -2.77121515e-01, 1.12740513e-01,
+ 2.35184013e-01, 3.46668895e-01, 9.38500782e-02), Vectors.dense(1.12324305e+01)),
+ (2, Vectors.dense(1.06195839e+01, -1.82891238e+00, 2.25085601e-01,
+ 9.09979851e-01, 6.80257535e-02, 8.24017480e-01), Vectors.dense(1.06195839e+01)),
+ (2, Vectors.dense(1.12806837e+01, 1.30686889e+00, 9.32839108e-02,
+ 3.49784755e-01, 1.71322408e-02, 7.48465194e-02), Vectors.dense(1.12806837e+01)),
+ (2, Vectors.dense(9.98689462e+00, 9.50808938e-01, -2.90786359e-01,
+ 2.31253009e-01, 7.46270968e-01, 1.60308169e-01), Vectors.dense(9.98689462e+00)),
+ (2, Vectors.dense(1.08428551e+01, -1.02749936e+00, 1.73951508e-01,
+ 8.92482744e-02, 1.42651730e-01, 7.66751625e-01), Vectors.dense(1.08428551e+01)),
+ (3, Vectors.dense(-1.98641448e+00, 1.12811990e+01, -2.35246756e-01,
+ 8.22809049e-01, 3.26739456e-01, 7.88268404e-01), Vectors.dense(-1.98641448e+00)),
+ (3, Vectors.dense(-6.09864090e-01, 1.07346276e+01, -2.18805509e-01,
+ 7.33931213e-01, 1.42554396e-01, 7.11225605e-01), Vectors.dense(-6.09864090e-01)),
+ (3, Vectors.dense(-1.58481268e+00, 9.19364039e+00, -5.87490459e-02,
+ 2.51532056e-01, 2.82729807e-01, 7.16245686e-01), Vectors.dense(-1.58481268e+00)),
+ (3, Vectors.dense(-2.50949277e-01, 1.12815254e+01, -6.94806734e-01,
+ 5.93898886e-01, 5.68425656e-01, 8.49762330e-01), Vectors.dense(-2.50949277e-01)),
+ (3, Vectors.dense(7.63485129e-01, 1.02605138e+01, 1.32617719e+00,
+ 5.49682879e-01, 8.59931442e-01, 4.88677978e-02), Vectors.dense(7.63485129e-01)),
+ (4, Vectors.dense(9.34900015e-01, 4.11379043e-01, 8.65010205e+00,
+ 9.23509168e-01, 1.16995043e-01, 5.91894106e-03), Vectors.dense(9.34900015e-01)),
+ (4, Vectors.dense(4.73734933e-01, -1.48321181e+00, 9.73349621e+00,
+ 4.09421563e-01, 5.09375719e-01, 5.93157850e-01), Vectors.dense(4.73734933e-01)),
+ (4, Vectors.dense(3.41470679e-01, -6.88972582e-01, 9.60347938e+00,
+ 3.62654055e-01, 2.43437468e-01, 7.13052838e-01), Vectors.dense(3.41470679e-01)),
+ (4, Vectors.dense(-5.29614251e-01, -1.39262856e+00, 1.01354144e+01,
+ 8.24123861e-01, 5.84074506e-01, 6.54461558e-01), Vectors.dense(-5.29614251e-01)),
+ (4, Vectors.dense(-2.99454508e-01, 2.20457263e+00, 1.14586015e+01,
+ 5.16336729e-01, 9.99776159e-01, 3.15769738e-01), Vectors.dense(-2.99454508e-01)))
+
+ datasetAnova = spark.createDataFrame(dataAnova).toDF("label", "features", "topFeature")
+
+ // scalastyle:off
+ /*
+ Use the following sklearn data in this test
+
+ >>> from sklearn.feature_selection import f_regression
+ >>> import numpy as np
+ >>> np.random.seed(777)
+ >>> X = np.random.rand(20, 6)
+ >>> w = np.array([0.3, 0.4, 0.5, 0, 0, 0])
+ >>> y = X @ w
+ >>> X
+ array([[0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
+ 0.27259261],
+ [0.27646426, 0.80187218, 0.95813935, 0.87593263, 0.35781727,
+ 0.50099513],
+ [0.68346294, 0.71270203, 0.37025075, 0.56119619, 0.50308317,
+ 0.01376845],
+ [0.77282662, 0.88264119, 0.36488598, 0.61539618, 0.07538124,
+ 0.36882401],
+ [0.9331401 , 0.65137814, 0.39720258, 0.78873014, 0.31683612,
+ 0.56809865],
+ [0.86912739, 0.43617342, 0.80214764, 0.14376682, 0.70426097,
+ 0.70458131],
+ [0.21879211, 0.92486763, 0.44214076, 0.90931596, 0.05980922,
+ 0.18428708],
+ [0.04735528, 0.67488094, 0.59462478, 0.53331016, 0.04332406,
+ 0.56143308],
+ [0.32966845, 0.50296683, 0.11189432, 0.60719371, 0.56594464,
+ 0.00676406],
+ [0.61744171, 0.91212289, 0.79052413, 0.99208147, 0.95880176,
+ 0.79196414],
+ [0.28525096, 0.62491671, 0.4780938 , 0.19567518, 0.38231745,
+ 0.05387369],
+ [0.45164841, 0.98200474, 0.1239427 , 0.1193809 , 0.73852306,
+ 0.58730363],
+ [0.47163253, 0.10712682, 0.22921857, 0.89996519, 0.41675354,
+ 0.53585166],
+ [0.00620852, 0.30064171, 0.43689317, 0.612149 , 0.91819808,
+ 0.62573667],
+ [0.70599757, 0.14983372, 0.74606341, 0.83100699, 0.63372577,
+ 0.43830988],
+ [0.15257277, 0.56840962, 0.52822428, 0.95142876, 0.48035918,
+ 0.50255956],
+ [0.53687819, 0.81920207, 0.05711564, 0.66942174, 0.76711663,
+ 0.70811536],
+ [0.79686718, 0.55776083, 0.96583653, 0.1471569 , 0.029647 ,
+ 0.59389349],
+ [0.1140657 , 0.95080985, 0.32570741, 0.19361869, 0.45781165,
+ 0.92040257],
+ [0.87906916, 0.25261576, 0.34800879, 0.18258873, 0.90179605,
+ 0.70652816]])
+ >>> y
+ array([0.52516321, 0.88275782, 0.67524507, 0.76734745, 0.73909458,
+ 0.83628141, 0.65665506, 0.58147135, 0.35603443, 0.94534373,
+ 0.57458887, 0.59026777, 0.29894977, 0.34056582, 0.64476446,
+ 0.53724782, 0.5173021 , 0.94508275, 0.57739736, 0.53877145])
+ >>> f_regression(X, y)
+ (array([5.58025504, 3.98311705, 20.59605518, 0.07993376, 1.25127646,
+ 0.7676937 ]),
+ array([2.96302196e-02, 6.13173918e-02, 2.54580618e-04, 7.80612726e-01,
+ 2.78015517e-01, 3.92474567e-01]))
+ */
+ // scalastyle:on
+
+ val dataFRegression = Seq(
+ (0.52516321, Vectors.dense(0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
+ 0.27259261), Vectors.dense(0.43772774)),
+ (0.88275782, Vectors.dense(0.27646426, 0.80187218, 0.95813935, 0.87593263, 0.35781727,
+ 0.50099513), Vectors.dense(0.95813935)),
+ (0.67524507, Vectors.dense(0.68346294, 0.71270203, 0.37025075, 0.56119619, 0.50308317,
+ 0.01376845), Vectors.dense(0.37025075)),
+ (0.76734745, Vectors.dense(0.77282662, 0.88264119, 0.36488598, 0.61539618, 0.07538124,
+ 0.36882401), Vectors.dense(0.36488598)),
+ (0.73909458, Vectors.dense(0.9331401, 0.65137814, 0.39720258, 0.78873014, 0.31683612,
+ 0.56809865), Vectors.dense(0.39720258)),
+
+ (0.83628141, Vectors.dense(0.86912739, 0.43617342, 0.80214764, 0.14376682, 0.70426097,
+ 0.70458131), Vectors.dense(0.80214764)),
+ (0.65665506, Vectors.dense(0.21879211, 0.92486763, 0.44214076, 0.90931596, 0.05980922,
+ 0.18428708), Vectors.dense(0.44214076)),
+ (0.58147135, Vectors.dense(0.04735528, 0.67488094, 0.59462478, 0.53331016, 0.04332406,
+ 0.56143308), Vectors.dense(0.59462478)),
+ (0.35603443, Vectors.dense(0.32966845, 0.50296683, 0.11189432, 0.60719371, 0.56594464,
+ 0.00676406), Vectors.dense(0.11189432)),
+ (0.94534373, Vectors.dense(0.61744171, 0.91212289, 0.79052413, 0.99208147, 0.95880176,
+ 0.79196414), Vectors.dense(0.79052413)),
+
+ (0.57458887, Vectors.dense(0.28525096, 0.62491671, 0.4780938, 0.19567518, 0.38231745,
+ 0.05387369), Vectors.dense(0.4780938)),
+ (0.59026777, Vectors.dense(0.45164841, 0.98200474, 0.1239427, 0.1193809, 0.73852306,
+ 0.58730363), Vectors.dense(0.1239427)),
+ (0.29894977, Vectors.dense(0.47163253, 0.10712682, 0.22921857, 0.89996519, 0.41675354,
+ 0.53585166), Vectors.dense(0.22921857)),
+ (0.34056582, Vectors.dense(0.00620852, 0.30064171, 0.43689317, 0.612149, 0.91819808,
+ 0.62573667), Vectors.dense(0.43689317)),
+ (0.64476446, Vectors.dense(0.70599757, 0.14983372, 0.74606341, 0.83100699, 0.63372577,
+ 0.43830988), Vectors.dense(0.74606341)),
+
+ (0.53724782, Vectors.dense(0.15257277, 0.56840962, 0.52822428, 0.95142876, 0.48035918,
+ 0.50255956), Vectors.dense(0.52822428)),
+ (0.5173021, Vectors.dense(0.53687819, 0.81920207, 0.05711564, 0.66942174, 0.76711663,
+ 0.70811536), Vectors.dense(0.05711564)),
+ (0.94508275, Vectors.dense(0.79686718, 0.55776083, 0.96583653, 0.1471569, 0.029647,
+ 0.59389349), Vectors.dense(0.96583653)),
+ (0.57739736, Vectors.dense(0.1140657, 0.95080985, 0.96583653, 0.19361869, 0.45781165,
+ 0.92040257), Vectors.dense(0.96583653)),
+ (0.53877145, Vectors.dense(0.87906916, 0.25261576, 0.34800879, 0.18258873, 0.90179605,
+ 0.70652816), Vectors.dense(0.34800879)))
+
+ datasetFRegression = spark.createDataFrame(dataFRegression)
+ .toDF("label", "features", "topFeature")
+
+ selector1 = new UnivariateFeatureSelector()
+ .setOutputCol("filtered")
+ .setFeatureType("continuous")
+ .setLabelType("categorical")
+ selector2 = new UnivariateFeatureSelector()
+ .setOutputCol("filtered")
+ .setFeatureType("continuous")
+ .setLabelType("continuous")
+ selector3 = new UnivariateFeatureSelector()
+ .setOutputCol("filtered")
+ .setFeatureType("categorical")
+ .setLabelType("categorical")
+ }
+
+ test("params") {
+ ParamsSuite.checkParams(new UnivariateFeatureSelector())
+ }
+
+ test("Test numTopFeatures") {
+ val testParams: Seq[(UnivariateFeatureSelector, Dataset[_])] = Seq(
+ (selector1.setSelectionMode("numTopFeatures").setSelectionThreshold(1), datasetAnova),
+ (selector2.setSelectionMode("numTopFeatures").setSelectionThreshold(1), datasetFRegression),
+ (selector3.setSelectionMode("numTopFeatures").setSelectionThreshold(1), datasetChi2)
+ )
+ for ((sel, dataset) <- testParams) {
+ val model = testSelector(sel, dataset)
+ MLTestingUtils.checkCopyAndUids(sel, model)
+ }
+ }
+
+ test("Test percentile") {
+ val testParams: Seq[(UnivariateFeatureSelector, Dataset[_])] = Seq(
+ (selector1.setSelectionMode("percentile").setSelectionThreshold(0.17), datasetAnova),
+ (selector2.setSelectionMode("percentile").setSelectionThreshold(0.17), datasetFRegression),
+ (selector3.setSelectionMode("percentile").setSelectionThreshold(0.17), datasetChi2)
+ )
+ for ((sel, dataset) <- testParams) {
+ val model = testSelector(sel, dataset)
+ MLTestingUtils.checkCopyAndUids(sel, model)
+ }
+ }
+
+ test("Test fpr") {
+ val testParams: Seq[(UnivariateFeatureSelector, Dataset[_])] = Seq(
+ (selector1.setSelectionMode("fpr").setSelectionThreshold(1.0E-12), datasetAnova),
+ (selector2.setSelectionMode("fpr").setSelectionThreshold(0.01), datasetFRegression),
+ (selector3.setSelectionMode("fpr").setSelectionThreshold(0.02), datasetChi2)
+ )
+ for ((sel, dataset) <- testParams) {
+ val model = testSelector(sel, dataset)
+ MLTestingUtils.checkCopyAndUids(sel, model)
+ }
+ }
+
+ test("Test fdr") {
+ val testParams: Seq[(UnivariateFeatureSelector, Dataset[_])] = Seq(
+ (selector1.setSelectionMode("fdr").setSelectionThreshold(6.0E-12), datasetAnova),
+ (selector2.setSelectionMode("fdr").setSelectionThreshold(0.03), datasetFRegression),
+ (selector3.setSelectionMode("fdr").setSelectionThreshold(0.12), datasetChi2)
+ )
+ for ((sel, dataset) <- testParams) {
+ val model = testSelector(sel, dataset)
+ MLTestingUtils.checkCopyAndUids(sel, model)
+ }
+ }
+
+ test("Test fwe") {
+ val testParams: Seq[(UnivariateFeatureSelector, Dataset[_])] = Seq(
+ (selector1.setSelectionMode("fwe").setSelectionThreshold(6.0E-12), datasetAnova),
+ (selector2.setSelectionMode("fwe").setSelectionThreshold(0.03), datasetFRegression),
+ (selector3.setSelectionMode("fwe").setSelectionThreshold(0.12), datasetChi2)
+ )
+ for ((sel, dataset) <- testParams) {
+ val model = testSelector(sel, dataset)
+ MLTestingUtils.checkCopyAndUids(sel, model)
+ }
+ }
+
+ // use the following sklean program to verify the test
+ // scalastyle:off
+ /*
+ import numpy as np
+ from sklearn.feature_selection import SelectFdr, f_classif
+
+ X = np.random.rand(10, 6)
+ w = np.array([5, 5, 0.0, 0, 0, 0]).reshape((-1, 1))
+ y = np.rint(0.1 * (X @ w)).flatten()
+ print(X)
+ print(y)
+
+ F, p = f_classif(X, y)
+ print('F', F)
+ print('p', p)
+ selected = SelectFdr(f_classif, alpha=0.25).fit(X, y).get_support(True)
+
+ print(selected)
+ */
+
+ /*
+ sklearn result
+ [[0.92166066 0.82295823 0.31276624 0.63069973 0.64679537 0.94138368]
+ [0.47027783 0.74907889 0.43660557 0.93212582 0.5654378 0.531748 ]
+ [0.67771108 0.23926502 0.66906295 0.73117095 0.67340005 0.52864934]
+ [0.84565144 0.28050298 0.94137135 0.42479664 0.21600724 0.98956871]
+ [0.58818255 0.32223507 0.13727654 0.80948059 0.94617741 0.48460179]
+ [0.59528639 0.75838511 0.98648654 0.65561948 0.83818237 0.30178127]
+ [0.00264811 0.46492597 0.71428557 0.94708987 0.54587827 0.9484639 ]
+ [0.94604186 0.43187098 0.42135172 0.77256283 0.44334613 0.1514674 ]
+ [0.45694004 0.00273459 0.14580367 0.74278963 0.57819284 0.99413419]
+ [0.02256925 0.56136702 0.0629738 0.64130602 0.01536191 0.56638321]]
+ [1. 1. 0. 1. 0. 1. 0. 1. 0. 0.]
+ F [5.66456136e+00 4.08120006e+00 1.85418412e+00 8.67095392e-01
+ 2.87769237e-03 3.66010633e-01]
+ p [0.04454332 0.07803464 0.21040406 0.37900428 0.95853411 0.56195058]
+ [0 1]
+
+ [[0.27976711 0.48397753 0.18451698 0.59844137 0.01459805 0.98895542]
+ [0.97192726 0.46737333 0.08048093 0.38253056 0.04776121 0.55949538]
+ [0.62559834 0.44102192 0.19199043 0.959706 0.5332824 0.78621594]
+ [0.91649448 0.76501992 0.58678528 0.75239909 0.33179368 0.00893317]
+ [0.14086806 0.21876364 0.31767297 0.53061653 0.02786653 0.20021944]
+ [0.15214833 0.03028593 0.12326784 0.55663152 0.8333684 0.76923807]
+ [0.88178287 0.8492688 0.29417221 0.98122401 0.44103191 0.32709781]
+ [0.06686689 0.05834763 0.41316273 0.92850555 0.77308549 0.2931857 ]
+ [0.94747449 0.78336777 0.76096282 0.52368192 0.64814324 0.60455684]
+ [0.83382261 0.31412713 0.62490246 0.43896432 0.35390503 0.02316754]]
+ [0. 1. 1. 1. 0. 0. 1. 0. 1. 1.]
+ F [9.22227201e+01 8.36710241e+00 1.22217112e+00 1.63526175e-02
+ 8.91954821e-03 6.44534477e-01]
+ p [1.14739663e-05 2.01189199e-02 3.01070031e-01 9.01402125e-01
+ 9.27079623e-01 4.45267639e-01]
+ [0 1]
+ */
+ // scalastyle:on
+ test("Test selectIndicesFromPValues f_classif") {
+ val data_f_classif1 = Seq(
+ (1, Vectors.dense(0.92166066, 0.82295823, 0.31276624, 0.63069973, 0.64679537, 0.94138368),
+ Vectors.dense(0.92166066, 0.82295823)),
+ (1, Vectors.dense(0.47027783, 0.74907889, 0.43660557, 0.93212582, 0.5654378, 0.531748),
+ Vectors.dense(0.47027783, 0.74907889)),
+ (0, Vectors.dense(0.67771108, 0.23926502, 0.66906295, 0.73117095, 0.67340005, 0.52864934),
+ Vectors.dense(0.67771108, 0.23926502)),
+ (1, Vectors.dense(0.84565144, 0.28050298, 0.94137135, 0.42479664, 0.21600724, 0.98956871),
+ Vectors.dense(0.84565144, 0.28050298)),
+ (0, Vectors.dense(0.58818255, 0.32223507, 0.13727654, 0.80948059, 0.94617741, 0.48460179),
+ Vectors.dense(0.58818255, 0.32223507)),
+ (1, Vectors.dense(0.59528639, 0.75838511, 0.98648654, 0.65561948, 0.83818237, 0.30178127),
+ Vectors.dense(0.59528639, 0.75838511)),
+ (0, Vectors.dense(0.00264811, 0.46492597, 0.71428557, 0.94708987, 0.54587827, 0.9484639),
+ Vectors.dense(0.00264811, 0.46492597)),
+ (1, Vectors.dense(0.94604186, 0.43187098, 0.42135172, 0.77256283, 0.44334613, 0.1514674),
+ Vectors.dense(0.94604186, 0.43187098)),
+ (0, Vectors.dense(0.45694004, 0.00273459, 0.14580367, 0.74278963, 0.57819284, 0.99413419),
+ Vectors.dense(0.45694004, 0.00273459)),
+ (0, Vectors.dense(0.02256925, 0.56136702, 0.0629738, 0.64130602, 0.01536191, 0.56638321),
+ Vectors.dense(0.02256925, 0.56136702)))
+
+ val data_f_classif2 = Seq(
+ (0, Vectors.dense(0.27976711, 0.48397753, 0.18451698, 0.59844137, 0.01459805, 0.98895542),
+ Vectors.dense(0.27976711, 0.48397753)),
+ (1, Vectors.dense(0.97192726, 0.46737333, 0.08048093, 0.38253056, 0.04776121, 0.55949538),
+ Vectors.dense(0.97192726, 0.46737333)),
+ (1, Vectors.dense(0.62559834, 0.44102192, 0.19199043, 0.959706, 0.5332824, 0.78621594),
+ Vectors.dense(0.62559834, 0.44102192)),
+ (1, Vectors.dense(0.91649448, 0.76501992, 0.58678528, 0.75239909, 0.33179368, 0.00893317),
+ Vectors.dense(0.91649448, 0.76501992)),
+ (0, Vectors.dense(0.14086806, 0.21876364, 0.31767297, 0.53061653, 0.02786653, 0.20021944),
+ Vectors.dense(0.14086806, 0.21876364)),
+ (0, Vectors.dense(0.15214833, 0.03028593, 0.12326784, 0.55663152, 0.8333684, 0.76923807),
+ Vectors.dense(0.15214833, 0.03028593)),
+ (1, Vectors.dense(0.88178287, 0.8492688, 0.29417221, 0.98122401, 0.44103191, 0.32709781),
+ Vectors.dense(0.88178287, 0.8492688)),
+ (0, Vectors.dense(0.06686689, 0.05834763, 0.41316273, 0.92850555, 0.77308549, 0.2931857),
+ Vectors.dense(0.06686689, 0.05834763)),
+ (1, Vectors.dense(0.94747449, 0.78336777, 0.76096282, 0.52368192, 0.64814324, 0.60455684),
+ Vectors.dense(0.94747449, 0.78336777)),
+ (1, Vectors.dense(0.83382261, 0.31412713, 0.62490246, 0.43896432, 0.35390503, 0.02316754),
+ Vectors.dense(0.83382261, 0.31412713)))
+
+ val dataset_f_classification1 =
+ spark.createDataFrame(data_f_classif1).toDF("label", "features", "topFeature")
+
+ val dataset_f_classification2 =
+ spark.createDataFrame(data_f_classif2).toDF("label", "features", "topFeature")
+
+ val resultDF1 = ANOVATest.test(dataset_f_classification1.toDF, "features", "label", true)
+ val resultDF2 = ANOVATest.test(dataset_f_classification2.toDF, "features", "label", true)
+ val selector = new UnivariateFeatureSelector()
+ .setOutputCol("filtered")
+ .setFeatureType("continuous")
+ .setLabelType("categorical")
+ val indices1 = selector.selectIndicesFromPValues(6, resultDF1, "fdr", 0.25)
+ val indices2 = selector.selectIndicesFromPValues(6, resultDF2, "fdr", 0.25)
+ assert(indices1(0) === 0 && indices1(1) === 1)
+ assert(indices2(0) === 0 && indices1(1) === 1)
+ }
+
+ // use the following sklean program to verify the test
+ // scalastyle:off
+ /* import numpy as np
+ from sklearn.feature_selection import SelectFdr, f_regression
+
+ X = np.random.rand(10, 6)
+ w = np.array([5, 5, 0.0, 0, 0, 0]).reshape((-1, 1))
+ y = (X @ w).flatten()
+ print(X)
+ print(y)
+
+ F, p = f_regression(X, y)
+ print('F', F)
+ print('p', p)
+ selected = SelectFdr(f_regression, alpha=0.1).fit(X, y).get_support(True)
+
+ print(selected) */
+
+ /* sklean result
+ [[5.19537247e-01 4.53144603e-01 2.10190418e-01 9.76237361e-01
+ 9.05792824e-01 9.34081024e-01]
+ [8.68906163e-01 5.49099467e-01 6.73567960e-01 3.94736897e-01
+ 9.98764158e-01 1.14285918e-01]
+ [2.56211244e-01 5.21857152e-01 6.55000402e-01 4.81092256e-01
+ 4.05802734e-02 1.59811005e-01]
+ [9.03076723e-01 1.80316576e-01 8.13131160e-01 6.92327901e-01
+ 4.77693321e-01 2.17284784e-01]
+ [4.75926597e-01 6.80511651e-01 9.55843875e-01 1.52627108e-01
+ 1.72766587e-01 6.45234673e-01]
+ [6.05829005e-01 8.43879811e-01 4.48596383e-01 7.25003439e-01
+ 2.83962640e-02 5.14414827e-01]
+ [8.57631869e-01 1.18279868e-01 2.84428492e-01 8.51544596e-01
+ 1.33220409e-02 1.87044251e-01]
+ [2.43360773e-01 4.83288948e-02 1.10430569e-01 4.33097852e-01
+ 5.63452248e-02 8.24333214e-01]
+ [2.18226531e-01 5.28477779e-01 3.01852956e-01 6.31664822e-04
+ 8.97463990e-01 8.25297034e-01]
+ [6.95170305e-01 7.35775299e-01 4.32188618e-01 2.26744166e-01
+ 5.13186095e-01 2.91635657e-01]]
+ [4.86340925 7.09002815 3.89034198 5.4169665 5.78219124 7.24854408
+ 4.87955868 1.45844834 3.73352155 7.15472802]
+ F [6.79932587 7.09311449 2.25262252 0.02652918 0.40812054 2.14464201]
+ p [0.03124895 0.02865887 0.17178184 0.87465381 0.54077957 0.18122753]
+ [0 1]
+ */
+
+ /* SKLearn result
+ [[0.21557113 0.66070242 0.89964323 0.1569332 0.84097522 0.61614986]
+ [0.14790391 0.40356507 0.2973803 0.53051143 0.35408457 0.88180598]
+ [0.39333276 0.42790148 0.41415147 0.82478069 0.57201431 0.49972278]
+ [0.46189165 0.460305 0.21054573 0.16588781 0.72898672 0.41290627]
+ [0.42527082 0.83902909 0.97275171 0.76947383 0.24470714 0.57847281]
+ [0.56185556 0.94463811 0.97741409 0.27233834 0.76460529 0.53085766]
+ [0.5828694 0.45827703 0.49305311 0.13803643 0.18242319 0.14182515]
+ [0.98848811 0.43453809 0.11712213 0.4849829 0.06431555 0.76125387]
+ [0.1181108 0.43820753 0.49576967 0.75729578 0.35355208 0.48165022]
+ [0.44250624 0.24310088 0.03976366 0.24023351 0.91659502 0.75260252]]
+ [4.38136774 2.7573449 4.10617119 4.61098326 6.32149954 7.53246836
+ 5.20573215 7.11513098 2.78159163 3.42803558]
+ F [11.90962327 6.49595546 1.51054886 0.17751367 0.40829523 0.1797005 ]
+ p [0.0086816 0.03424301 0.25397764 0.68461076 0.54069506 0.68279904]
+ [0] */
+ // scalastyle:on
+ test("Test selectIndicesFromPValues f_regression") {
+ val data_f_regression1 = Seq(
+ (4.86340925, Vectors.dense(5.19537247e-01, 4.53144603e-01, 2.10190418e-01, 9.76237361e-01,
+ 9.05792824e-01, 9.34081024e-01), Vectors.dense(5.19537247e-01, 4.53144603e-01)),
+ (7.09002815, Vectors.dense(8.68906163e-01, 5.49099467e-01, 6.73567960e-01, 3.94736897e-01,
+ 9.98764158e-01, 1.14285918e-01), Vectors.dense(8.68906163e-01, 5.49099467e-01)),
+ (3.89034198, Vectors.dense(2.56211244e-01, 5.21857152e-01, 6.55000402e-01, 4.81092256e-01,
+ 4.05802734e-02, 1.59811005e-01), Vectors.dense(2.56211244e-01, 5.21857152e-01)),
+ (5.4169665, Vectors.dense(9.03076723e-01, 1.80316576e-01, 8.13131160e-01, 6.92327901e-01,
+ 4.77693321e-01, 2.17284784e-01), Vectors.dense(9.03076723e-01, 1.80316576e-01)),
+ (5.78219124, Vectors.dense(4.75926597e-01, 6.80511651e-01, 9.55843875e-01, 1.52627108e-01,
+ 1.72766587e-01, 6.45234673e-01), Vectors.dense(4.75926597e-01, 6.80511651e-01)),
+ (7.24854408, Vectors.dense(6.05829005e-01, 8.43879811e-01, 4.48596383e-01, 7.25003439e-01,
+ 2.83962640e-02, 5.14414827e-01), Vectors.dense(6.05829005e-01, 8.43879811e-01)),
+ (4.87955868, Vectors.dense(8.57631869e-01, 1.18279868e-01, 2.84428492e-01, 8.51544596e-01,
+ 1.33220409e-02, 1.87044251e-01), Vectors.dense(8.57631869e-01, 1.18279868e-01)),
+ (1.45844834, Vectors.dense(2.43360773e-01, 4.83288948e-02, 1.10430569e-01, 4.33097852e-01,
+ 5.63452248e-02, 8.24333214e-01), Vectors.dense(2.43360773e-01, 4.83288948e-02)),
+ (3.73352155, Vectors.dense(2.18226531e-01, 5.28477779e-01, 3.01852956e-01, 6.31664822e-04,
+ 8.97463990e-01, 8.25297034e-01), Vectors.dense(2.18226531e-01, 5.28477779e-01)),
+ (7.15472802, Vectors.dense(6.95170305e-01, 7.35775299e-01, 4.32188618e-01, 2.26744166e-01,
+ 5.13186095e-01, 2.91635657e-01), Vectors.dense(6.95170305e-01, 7.35775299e-01)))
+
+ val data_f_regression2 = Seq(
+ (4.38136774, Vectors.dense(0.21557113, 0.66070242, 0.89964323, 0.1569332, 0.84097522,
+ 0.61614986), Vectors.dense(0.21557113)),
+ (2.7573449, Vectors.dense(0.14790391, 0.40356507, 0.2973803, 0.53051143, 0.35408457,
+ 0.88180598), Vectors.dense(0.14790391)),
+ (4.10617119, Vectors.dense(0.39333276, 0.42790148, 0.41415147, 0.82478069, 0.57201431,
+ 0.49972278), Vectors.dense(0.39333276)),
+ (4.61098326, Vectors.dense(0.46189165, 0.460305, 0.21054573, 0.16588781, 0.72898672,
+ 0.41290627), Vectors.dense(0.46189165)),
+ (6.32149954, Vectors.dense(0.42527082, 0.83902909, 0.97275171, 0.76947383, 0.24470714,
+ 0.57847281), Vectors.dense(0.42527082)),
+ (7.53246836, Vectors.dense(0.56185556, 0.94463811, 0.97741409, 0.27233834, 0.76460529,
+ 0.53085766), Vectors.dense(0.56185556)),
+ (5.20573215, Vectors.dense(0.5828694, 0.45827703, 0.49305311, 0.13803643, 0.18242319,
+ 0.14182515), Vectors.dense(0.5828694)),
+ (7.11513098, Vectors.dense(0.98848811, 0.43453809, 0.11712213, 0.4849829, 0.06431555,
+ 0.76125387), Vectors.dense(0.98848811)),
+ (2.78159163, Vectors.dense(0.1181108, 0.43820753, 0.49576967, 0.75729578, 0.35355208,
+ 0.48165022), Vectors.dense(0.1181108)),
+ (3.42803558, Vectors.dense(0.44250624, 0.24310088, 0.03976366, 0.24023351, 0.91659502,
+ 0.75260252), Vectors.dense(0.44250624)))
+
+ val dataset_f_regression1 =
+ spark.createDataFrame(data_f_regression1).toDF("label", "features", "topFeature")
+
+ val dataset_f_regression2 =
+ spark.createDataFrame(data_f_regression2).toDF("label", "features", "topFeature")
+
+ val resultDF1 = FValueTest.test(dataset_f_regression1.toDF, "features", "label", true)
+ val resultDF2 = FValueTest.test(dataset_f_regression2.toDF, "features", "label", true)
+ val selector = new UnivariateFeatureSelector()
+ .setOutputCol("filtered")
+ .setFeatureType("continuous")
+ .setLabelType("continuous")
+ val indices1 = selector.selectIndicesFromPValues(6, resultDF1, "fdr", 0.1)
+ val indices2 = selector.selectIndicesFromPValues(6, resultDF2, "fdr", 0.1)
+ assert(indices1(0) === 1 && indices1(1) === 0)
+ assert(indices2(0) === 0)
+ }
+
+ test("read/write") {
+ def checkModelData(
+ model: UnivariateFeatureSelectorModel,
+ model2: UnivariateFeatureSelectorModel): Unit = {
+ assert(model.selectedFeatures === model2.selectedFeatures)
+ }
+ val selector = new UnivariateFeatureSelector()
+ .setFeatureType("continuous")
+ .setLabelType("categorical")
+ testEstimatorAndModelReadWrite(selector, datasetAnova,
+ UnivariateFeatureSelectorSuite.allParamSettings,
+ UnivariateFeatureSelectorSuite.allParamSettings, checkModelData)
+ }
+
+ private def testSelector(selector: UnivariateFeatureSelector, data: Dataset[_]):
+ UnivariateFeatureSelectorModel = {
+ val selectorModel = selector.fit(data)
+ testTransformer[(Double, Vector, Vector)](data.toDF(), selectorModel,
+ "filtered", "topFeature") {
+ case Row(vec1: Vector, vec2: Vector) =>
+ assert(vec1 ~== vec2 absTol 1e-1)
+ }
+ selectorModel
+ }
+}
+
+object UnivariateFeatureSelectorSuite {
+
+ /**
+ * Mapping from all Params to valid settings which differ from the defaults.
+ * This is useful for tests which need to exercise all Params, such as save/load.
+ * This excludes input columns to simplify some tests.
+ */
+ val allParamSettings: Map[String, Any] = Map(
+ "selectionMode" -> "percentile",
+ "selectionThreshold" -> 0.12,
+ "outputCol" -> "myOutput"
+ )
+}
diff --git a/python/docs/source/reference/pyspark.ml.rst b/python/docs/source/reference/pyspark.ml.rst
index cc90459..7837d60 100644
--- a/python/docs/source/reference/pyspark.ml.rst
+++ b/python/docs/source/reference/pyspark.ml.rst
@@ -61,8 +61,6 @@ Feature
:template: autosummary/class_with_docs.rst
:toctree: api/
- ANOVASelector
- ANOVASelectorModel
Binarizer
BucketedRandomProjectionLSH
BucketedRandomProjectionLSHModel
@@ -74,8 +72,6 @@ Feature
DCT
ElementwiseProduct
FeatureHasher
- FValueSelector
- FValueSelectorModel
HashingTF
IDF
IDFModel
@@ -109,6 +105,8 @@ Feature
StringIndexer
StringIndexerModel
Tokenizer
+ UnivariateFeatureSelector
+ UnivariateFeatureSelectorModel
VarianceThresholdSelector
VarianceThresholdSelectorModel
VectorAssembler
@@ -272,10 +270,8 @@ Statistics
:template: autosummary/class_with_docs.rst
:toctree: api/
- ANOVATest
ChiSquareTest
Correlation
- FValueTest
KolmogorovSmirnovTest
MultivariateGaussian
Summarizer
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 546c463..f9d22ba 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -24,8 +24,7 @@ from pyspark.ml.util import JavaMLReadable, JavaMLWritable
from pyspark.ml.wrapper import JavaEstimator, JavaModel, JavaParams, JavaTransformer, _jvm
from pyspark.ml.common import inherit_doc
-__all__ = ['ANOVASelector', 'ANOVASelectorModel',
- 'Binarizer',
+__all__ = ['Binarizer',
'BucketedRandomProjectionLSH', 'BucketedRandomProjectionLSHModel',
'Bucketizer',
'ChiSqSelector', 'ChiSqSelectorModel',
@@ -33,7 +32,6 @@ __all__ = ['ANOVASelector', 'ANOVASelectorModel',
'DCT',
'ElementwiseProduct',
'FeatureHasher',
- 'FValueSelector', 'FValueSelectorModel',
'HashingTF',
'IDF', 'IDFModel',
'Imputer', 'ImputerModel',
@@ -56,6 +54,7 @@ __all__ = ['ANOVASelector', 'ANOVASelectorModel',
'StopWordsRemover',
'StringIndexer', 'StringIndexerModel',
'Tokenizer',
+ 'UnivariateFeatureSelector', 'UnivariateFeatureSelectorModel',
'VarianceThresholdSelector', 'VarianceThresholdSelectorModel',
'VectorAssembler',
'VectorIndexer', 'VectorIndexerModel',
@@ -5413,106 +5412,6 @@ class _SelectorModel(JavaModel, _SelectorParams):
@inherit_doc
-class ANOVASelector(_Selector, JavaMLReadable, JavaMLWritable):
- """
- ANOVA F-value Classification selector, which selects continuous features to use for predicting
- a categorical label.
- The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
- `fdr`, `fwe`.
-
- - `numTopFeatures` chooses a fixed number of top features according to a F value
- classification test.
- - `percentile` is similar but chooses a fraction of all features
- instead of a fixed number.
- - `fpr` chooses all features whose p-values are below a threshold,
- thus controlling the false positive rate of selection.
- - `fdr` uses the `Benjamini-Hochberg procedure \
- <https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
- to choose all features whose false discovery rate is below a threshold.
- - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
- 1 / `numFeatures`, thus controlling the family-wise error rate of selection.
-
- By default, the selection method is `numTopFeatures`, with the default number of top features
- set to 50.
-
- .. versionadded:: 3.1.0
-
- Examples
- --------
- >>> from pyspark.ml.linalg import Vectors
- >>> df = spark.createDataFrame(
- ... [(Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3]), 3.0),
- ... (Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1]), 2.0),
- ... (Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5]), 1.0),
- ... (Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8]), 2.0),
- ... (Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0),
- ... (Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0)],
- ... ["features", "label"])
- >>> selector = ANOVASelector(numTopFeatures=1, outputCol="selectedFeatures")
- >>> model = selector.fit(df)
- >>> model.getFeaturesCol()
- 'features'
- >>> model.setFeaturesCol("features")
- ANOVASelectorModel...
- >>> model.transform(df).head().selectedFeatures
- DenseVector([7.6])
- >>> model.selectedFeatures
- [2]
- >>> anovaSelectorPath = temp_path + "/anova-selector"
- >>> selector.save(anovaSelectorPath)
- >>> loadedSelector = ANOVASelector.load(anovaSelectorPath)
- >>> loadedSelector.getNumTopFeatures() == selector.getNumTopFeatures()
- True
- >>> modelPath = temp_path + "/anova-selector-model"
- >>> model.save(modelPath)
- >>> loadedModel = ANOVASelectorModel.load(modelPath)
- >>> loadedModel.selectedFeatures == model.selectedFeatures
- True
- >>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
- True
- """
-
- @keyword_only
- def __init__(self, *, numTopFeatures=50, featuresCol="features", outputCol=None,
- labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05,
- fdr=0.05, fwe=0.05):
- """
- __init__(self, \\*, numTopFeatures=50, featuresCol="features", outputCol=None, \
- labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, \
- fdr=0.05, fwe=0.05)
- """
- super(ANOVASelector, self).__init__()
- self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.ANOVASelector", self.uid)
- kwargs = self._input_kwargs
- self.setParams(**kwargs)
-
- @keyword_only
- @since("3.1.0")
- def setParams(self, *, numTopFeatures=50, featuresCol="features", outputCol=None,
- labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05,
- fdr=0.05, fwe=0.05):
- """
- setParams(self, \\*, numTopFeatures=50, featuresCol="features", outputCol=None, \
- labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, \
- fdr=0.05, fwe=0.05)
- Sets params for this ANOVASelector.
- """
- kwargs = self._input_kwargs
- return self._set(**kwargs)
-
- def _create_model(self, java_model):
- return ANOVASelectorModel(java_model)
-
-
-class ANOVASelectorModel(_SelectorModel, JavaMLReadable, JavaMLWritable):
- """
- Model fitted by :py:class:`ANOVASelector`.
-
- .. versionadded:: 3.1.0
- """
-
-
-@inherit_doc
class ChiSqSelector(_Selector, JavaMLReadable, JavaMLWritable):
"""
Chi-Squared feature selection, which selects categorical features to use for predicting a
@@ -5538,6 +5437,9 @@ class ChiSqSelector(_Selector, JavaMLReadable, JavaMLWritable):
By default, the selection method is `numTopFeatures`, with the default number of top features
set to 50.
+ .. deprecated:: 3.1.0
+ Use UnivariateFeatureSelector
+
.. versionadded:: 2.0.0
Examples
@@ -5613,110 +5515,6 @@ class ChiSqSelectorModel(_SelectorModel, JavaMLReadable, JavaMLWritable):
@inherit_doc
-class FValueSelector(_Selector, JavaMLReadable, JavaMLWritable):
- """
- F Value Regression feature selector, which selects continuous features to use for predicting a
- continuous label.
- The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
- `fdr`, `fwe`.
-
- * `numTopFeatures` chooses a fixed number of top features according to a F value
- regression test.
-
- * `percentile` is similar but chooses a fraction of all features
- instead of a fixed number.
-
- * `fpr` chooses all features whose p-values are below a threshold,
- thus controlling the false positive rate of selection.
-
- * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
- False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
- to choose all features whose false discovery rate is below a threshold.
-
- * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
- 1/numFeatures, thus controlling the family-wise error rate of selection.
-
- By default, the selection method is `numTopFeatures`, with the default number of top features
- set to 50.
-
- .. versionadded:: 3.1.0
-
- Examples
- --------
- >>> from pyspark.ml.linalg import Vectors
- >>> df = spark.createDataFrame(
- ... [(Vectors.dense([6.0, 7.0, 0.0, 7.0, 6.0, 0.0]), 4.6),
- ... (Vectors.dense([0.0, 9.0, 6.0, 0.0, 5.0, 9.0]), 6.6),
- ... (Vectors.dense([0.0, 9.0, 3.0, 0.0, 5.0, 5.0]), 5.1),
- ... (Vectors.dense([0.0, 9.0, 8.0, 5.0, 6.0, 4.0]), 7.6),
- ... (Vectors.dense([8.0, 9.0, 6.0, 5.0, 4.0, 4.0]), 9.0),
- ... (Vectors.dense([8.0, 9.0, 6.0, 4.0, 0.0, 0.0]), 9.0)],
- ... ["features", "label"])
- >>> selector = FValueSelector(numTopFeatures=1, outputCol="selectedFeatures")
- >>> model = selector.fit(df)
- >>> model.getFeaturesCol()
- 'features'
- >>> model.setFeaturesCol("features")
- FValueSelectorModel...
- >>> model.transform(df).head().selectedFeatures
- DenseVector([0.0])
- >>> model.selectedFeatures
- [2]
- >>> fvalueSelectorPath = temp_path + "/fvalue-selector"
- >>> selector.save(fvalueSelectorPath)
- >>> loadedSelector = FValueSelector.load(fvalueSelectorPath)
- >>> loadedSelector.getNumTopFeatures() == selector.getNumTopFeatures()
- True
- >>> modelPath = temp_path + "/fvalue-selector-model"
- >>> model.save(modelPath)
- >>> loadedModel = FValueSelectorModel.load(modelPath)
- >>> loadedModel.selectedFeatures == model.selectedFeatures
- True
- >>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
- True
- """
-
- @keyword_only
- def __init__(self, *, numTopFeatures=50, featuresCol="features", outputCol=None,
- labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05,
- fdr=0.05, fwe=0.05):
- """
- __init__(self, \\*, numTopFeatures=50, featuresCol="features", outputCol=None, \
- labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, \
- fdr=0.05, fwe=0.05)
- """
- super(FValueSelector, self).__init__()
- self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.FValueSelector", self.uid)
- kwargs = self._input_kwargs
- self.setParams(**kwargs)
-
- @keyword_only
- @since("3.1.0")
- def setParams(self, *, numTopFeatures=50, featuresCol="features", outputCol=None,
- labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05,
- fdr=0.05, fwe=0.05):
- """
- setParams(self, \\*, numTopFeatures=50, featuresCol="features", outputCol=None, \
- labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, \
- fdr=0.05, fwe=0.05)
- Sets params for this FValueSelector.
- """
- kwargs = self._input_kwargs
- return self._set(**kwargs)
-
- def _create_model(self, java_model):
- return FValueSelectorModel(java_model)
-
-
-class FValueSelectorModel(_SelectorModel, JavaMLReadable, JavaMLWritable):
- """
- Model fitted by :py:class:`FValueSelector`.
-
- .. versionadded:: 3.1.0
- """
-
-
-@inherit_doc
class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReadable,
JavaMLWritable):
"""
@@ -5952,6 +5750,243 @@ class VarianceThresholdSelectorModel(JavaModel, _VarianceThresholdSelectorParams
return self._call_java("selectedFeatures")
+class _UnivariateFeatureSelectorParams(HasFeaturesCol, HasOutputCol, HasLabelCol):
+ """
+ Params for :py:class:`UnivariateFeatureSelector` and
+ :py:class:`UnivariateFeatureSelectorModel`.
+
+ .. versionadded:: 3.1.0
+ """
+
+ featureType = Param(Params._dummy(), "featureType",
+ "The feature type. " +
+ "Supported options: categorical, continuous.",
+ typeConverter=TypeConverters.toString)
+
+ labelType = Param(Params._dummy(), "labelType",
+ "The label type. " +
+ "Supported options: categorical, continuous.",
+ typeConverter=TypeConverters.toString)
+
+ selectionMode = Param(Params._dummy(), "selectionMode",
+ "The selection mode. " +
+ "Supported options: numTopFeatures (default), percentile, fpr, " +
+ "fdr, fwe.",
+ typeConverter=TypeConverters.toString)
+
+ selectionThreshold = Param(Params._dummy(), "selectionThreshold", "The upper bound of the " +
+ "features that selector will select.",
+ typeConverter=TypeConverters.toFloat)
+
+ def __init__(self, *args):
+ super(_UnivariateFeatureSelectorParams, self).__init__(*args)
+ self._setDefault(selectionMode="numTopFeatures")
+
+ @since("3.1.1")
+ def getFeatureType(self):
+ """
+ Gets the value of featureType or its default value.
+ """
+ return self.getOrDefault(self.featureType)
+
+ @since("3.1.1")
+ def getLabelType(self):
+ """
+ Gets the value of labelType or its default value.
+ """
+ return self.getOrDefault(self.labelType)
+
+ @since("3.1.1")
+ def getSelectionMode(self):
+ """
+ Gets the value of selectionMode or its default value.
+ """
+ return self.getOrDefault(self.selectionMode)
+
+ @since("3.1.1")
+ def getSelectionThreshold(self):
+ """
+ Gets the value of selectionThreshold or its default value.
+ """
+ return self.getOrDefault(self.selectionThreshold)
+
+
+@inherit_doc
+class UnivariateFeatureSelector(JavaEstimator, _UnivariateFeatureSelectorParams, JavaMLReadable,
+ JavaMLWritable):
+ """
+ UnivariateFeatureSelector
+ The user can set `featureType` and `labelType`, and Spark will pick the score function based on
+ the specified `featureType` and `labelType`.
+
+ The following combination of `featureType` and `labelType` are supported:
+
+ - `featureType` `categorical` and `labelType` `categorical`, Spark uses chi2.
+ - `featureType` `continuous` and `labelType` `categorical`, Spark uses f_classif.
+ - `featureType` `continuous` and `labelType` `continuous`, Spark uses f_regression.
+
+ The `UnivariateFeatureSelector` supports different selection modes: `numTopFeatures`,
+ `percentile`, `fpr`, `fdr`, `fwe`.
+
+ - `numTopFeatures` chooses a fixed number of top features according to a according to a
+ hypothesis.
+ - `percentile` is similar but chooses a fraction of all features
+ instead of a fixed number.
+ - `fpr` chooses all features whose p-values are below a threshold,
+ thus controlling the false positive rate of selection.
+ - `fdr` uses the `Benjamini-Hochberg procedure \
+ <https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
+ to choose all features whose false discovery rate is below a threshold.
+ - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ 1 / `numFeatures`, thus controlling the family-wise error rate of selection.
+
+ By default, the selection mode is `numTopFeatures`.
+
+ .. versionadded:: 3.1.1
+
+ Examples
+ --------
+ >>> from pyspark.ml.linalg import Vectors
+ >>> df = spark.createDataFrame(
+ ... [(Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3]), 3.0),
+ ... (Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1]), 2.0),
+ ... (Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5]), 1.0),
+ ... (Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8]), 2.0),
+ ... (Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0),
+ ... (Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0)],
+ ... ["features", "label"])
+ >>> selector = UnivariateFeatureSelector(outputCol="selectedFeatures")
+ >>> selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(1)
+ UnivariateFeatureSelector...
+ >>> model = selector.fit(df)
+ >>> model.getFeaturesCol()
+ 'features'
+ >>> model.setFeaturesCol("features")
+ UnivariateFeatureSelectorModel...
+ >>> model.transform(df).head().selectedFeatures
+ DenseVector([7.6])
+ >>> model.selectedFeatures
+ [2]
+ >>> selectorPath = temp_path + "/selector"
+ >>> selector.save(selectorPath)
+ >>> loadedSelector = UnivariateFeatureSelector.load(selectorPath)
+ >>> loadedSelector.getSelectionThreshold() == selector.getSelectionThreshold()
+ True
+ >>> modelPath = temp_path + "/selector-model"
+ >>> model.save(modelPath)
+ >>> loadedModel = UnivariateFeatureSelectorModel.load(modelPath)
+ >>> loadedModel.selectedFeatures == model.selectedFeatures
+ True
+ >>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
+ True
+ """
+
+ @keyword_only
+ def __init__(self, *, featuresCol="features", outputCol=None,
+ labelCol="label", selectionMode="numTopFeatures"):
+ """
+ __init__(self, \\*, featuresCol="features", outputCol=None, \
+ labelCol="label", selectionMode="numTopFeatures")
+ """
+ super(UnivariateFeatureSelector, self).__init__()
+ self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.UnivariateFeatureSelector",
+ self.uid)
+ kwargs = self._input_kwargs
+ self.setParams(**kwargs)
+
+ @keyword_only
+ @since("3.1.1")
+ def setParams(self, *, featuresCol="features", outputCol=None,
+ labelCol="labels", selectionMode="numTopFeatures"):
+ """
+ setParams(self, \\*, featuresCol="features", outputCol=None, \
+ labelCol="labels", selectionMode="numTopFeatures")
+ Sets params for this UnivariateFeatureSelector.
+ """
+ kwargs = self._input_kwargs
+ return self._set(**kwargs)
+
+ @since("3.1.1")
+ def setFeatureType(self, value):
+ """
+ Sets the value of :py:attr:`featureType`.
+ """
+ return self._set(featureType=value)
+
+ @since("3.1.1")
+ def setLabelType(self, value):
+ """
+ Sets the value of :py:attr:`labelType`.
+ """
+ return self._set(labelType=value)
+
+ @since("3.1.1")
+ def setSelectionMode(self, value):
+ """
+ Sets the value of :py:attr:`selectionMode`.
+ """
+ return self._set(selectionMode=value)
+
+ @since("3.1.1")
+ def setSelectionThreshold(self, value):
+ """
+ Sets the value of :py:attr:`selectionThreshold`.
+ """
+ return self._set(selectionThreshold=value)
+
+ def setFeaturesCol(self, value):
+ """
+ Sets the value of :py:attr:`featuresCol`.
+ """
+ return self._set(featuresCol=value)
+
+ def setOutputCol(self, value):
+ """
+ Sets the value of :py:attr:`outputCol`.
+ """
+ return self._set(outputCol=value)
+
+ def setLabelCol(self, value):
+ """
+ Sets the value of :py:attr:`labelCol`.
+ """
+ return self._set(labelCol=value)
+
+ def _create_model(self, java_model):
+ return UnivariateFeatureSelectorModel(java_model)
+
+
+class UnivariateFeatureSelectorModel(JavaModel, _UnivariateFeatureSelectorParams, JavaMLReadable,
+ JavaMLWritable):
+ """
+ Model fitted by :py:class:`UnivariateFeatureSelector`.
+
+ .. versionadded:: 3.1.1
+ """
+
+ @since("3.1.1")
+ def setFeaturesCol(self, value):
+ """
+ Sets the value of :py:attr:`featuresCol`.
+ """
+ return self._set(featuresCol=value)
+
+ @since("3.1.1")
+ def setOutputCol(self, value):
+ """
+ Sets the value of :py:attr:`outputCol`.
+ """
+ return self._set(outputCol=value)
+
+ @property
+ @since("3.1.1")
+ def selectedFeatures(self):
+ """
+ List of indices to select (filter).
+ """
+ return self._call_java("selectedFeatures")
+
+
if __name__ == "__main__":
import doctest
import sys
diff --git a/python/pyspark/ml/feature.pyi b/python/pyspark/ml/feature.pyi
index 4999def..33e4691 100644
--- a/python/pyspark/ml/feature.pyi
+++ b/python/pyspark/ml/feature.pyi
@@ -1456,38 +1456,6 @@ class _SelectorModel(JavaModel, _SelectorParams):
@property
def selectedFeatures(self) -> List[int]: ...
-class ANOVASelector(
- _Selector[ANOVASelectorModel], JavaMLReadable[ANOVASelector], JavaMLWritable
-):
- def __init__(
- self,
- numTopFeatures: int = ...,
- featuresCol: str = ...,
- outputCol: Optional[str] = ...,
- labelCol: str = ...,
- selectorType: str = ...,
- percentile: float = ...,
- fpr: float = ...,
- fdr: float = ...,
- fwe: float = ...,
- ) -> None: ...
- def setParams(
- self,
- numTopFeatures: int = ...,
- featuresCol: str = ...,
- outputCol: Optional[str] = ...,
- labelCol: str = ...,
- selectorType: str = ...,
- percentile: float = ...,
- fpr: float = ...,
- fdr: float = ...,
- fwe: float = ...,
- ) -> ANOVASelector: ...
-
-class ANOVASelectorModel(
- _SelectorModel, JavaMLReadable[ANOVASelectorModel], JavaMLWritable
-): ...
-
class ChiSqSelector(
_Selector[ChiSqSelectorModel],
JavaMLReadable[ChiSqSelector],
@@ -1565,38 +1533,6 @@ class VectorSizeHint(
def setInputCol(self, value: str) -> VectorSizeHint: ...
def setHandleInvalid(self, value: str) -> VectorSizeHint: ...
-class FValueSelector(
- _Selector[FValueSelectorModel], JavaMLReadable[FValueSelector], JavaMLWritable
-):
- def __init__(
- self,
- numTopFeatures: int = ...,
- featuresCol: str = ...,
- outputCol: Optional[str] = ...,
- labelCol: str = ...,
- selectorType: str = ...,
- percentile: float = ...,
- fpr: float = ...,
- fdr: float = ...,
- fwe: float = ...,
- ) -> None: ...
- def setParams(
- self,
- numTopFeatures: int = ...,
- featuresCol: str = ...,
- outputCol: Optional[str] = ...,
- labelCol: str = ...,
- selectorType: str = ...,
- percentile: float = ...,
- fpr: float = ...,
- fdr: float = ...,
- fwe: float = ...,
- ) -> FValueSelector: ...
-
-class FValueSelectorModel(
- _SelectorModel, JavaMLReadable[FValueSelectorModel], JavaMLWritable
-): ...
-
class _VarianceThresholdSelectorParams(HasFeaturesCol, HasOutputCol):
varianceThreshold: Param[float] = ...
def getVarianceThreshold(self) -> float: ...
@@ -1633,3 +1569,55 @@ class VarianceThresholdSelectorModel(
def setOutputCol(self, value: str) -> VarianceThresholdSelectorModel: ...
@property
def selectedFeatures(self) -> List[int]: ...
+
+class _UnivariateFeatureSelectorParams(HasFeaturesCol, HasOutputCol, HasLabelCol):
+ featureType: Param[str] = ...
+ labelType: Param[str] = ...
+ selectionMode: Param[str] = ...
+ selectionThreshold: Param[float] = ...
+ def __init__(self, *args: Any): ...
+ def getFeatureType(self) -> str: ...
+ def getLabelType(self) -> str: ...
+ def getSelectionMode(self) -> str: ...
+ def getSelectionThreshold(self) -> float: ...
+
+class UnivariateFeatureSelector(
+ JavaEstimator[UnivariateFeatureSelectorModel],
+ _UnivariateFeatureSelectorParams,
+ JavaMLReadable[UnivariateFeatureSelector],
+ JavaMLWritable,
+):
+ def __init__(
+ self,
+ *,
+ featuresCol: str = ...,
+ outputCol: Optional[str] = ...,
+ labelCol: str = ...,
+ selectionMode: str = ...,
+ ) -> None: ...
+ def setParams(
+ self,
+ *,
+ featuresCol: str = ...,
+ outputCol: Optional[str] = ...,
+ labelCol: str = ...,
+ selectionMode: str = ...,
+ ) -> UnivariateFeatureSelector: ...
+ def setFeatureType(self, value: str) -> UnivariateFeatureSelector: ...
+ def setLabelType(self, value: str) -> UnivariateFeatureSelector: ...
+ def setSelectionMode(self, value: str) -> UnivariateFeatureSelector: ...
+ def setSelectionThreshold(self, value: float) -> UnivariateFeatureSelector: ...
+ def setFeaturesCol(self, value: str) -> UnivariateFeatureSelector: ...
+ def setOutputCol(self, value: str) -> UnivariateFeatureSelector: ...
+ def setLabelCol(self, value: str) -> UnivariateFeatureSelector: ...
+
+class UnivariateFeatureSelectorModel(
+ JavaModel,
+ _UnivariateFeatureSelectorParams,
+ JavaMLReadable[UnivariateFeatureSelectorModel],
+ JavaMLWritable,
+):
+ def setFeaturesCol(self, value: str) -> UnivariateFeatureSelectorModel: ...
+ def setOutputCol(self, value: str) -> UnivariateFeatureSelectorModel: ...
+ @property
+ def selectedFeatures(self) -> List[int]: ...
diff --git a/python/pyspark/ml/stat.py b/python/pyspark/ml/stat.py
index 4388de1..60eeb68 100644
--- a/python/pyspark/ml/stat.py
+++ b/python/pyspark/ml/stat.py
@@ -467,154 +467,6 @@ class MultivariateGaussian(object):
self.cov = cov
-class ANOVATest(object):
- """
- Conduct ANOVA Classification Test for continuous features against categorical labels.
-
- .. versionadded:: 3.1.0
- """
- @staticmethod
- def test(dataset, featuresCol, labelCol, flatten=False):
- """
- Perform an ANOVA test using dataset.
-
- .. versionadded:: 3.1.0
-
- Parameters
- ----------
- dataset : :py:class:`pyspark.sql.DataFrame`
- DataFrame of categorical labels and continuous features.
- featuresCol : str
- Name of features column in dataset, of type `Vector` (`VectorUDT`).
- labelCol : str
- Name of label column in dataset, of any numerical type.
- flatten : bool, optional
- if True, flattens the returned dataframe.
-
- Returns
- -------
- :py:class:`pyspark.sql.DataFrame`
- DataFrame containing the test result for every feature against the label.
- If flatten is True, this DataFrame will contain one row per feature with the following
- fields:
-
- - `featureIndex: int`
- - `pValue: float`
- - `degreesOfFreedom: int`
- - `fValue: float`
-
- If flatten is False, this DataFrame will contain a single Row with the following fields:
-
- - `pValues: Vector`
- - `degreesOfFreedom: Array[int]`
- - `fValues: Vector`
-
- Each of these fields has one value per feature.
-
- Examples
- --------
- >>> from pyspark.ml.linalg import Vectors
- >>> from pyspark.ml.stat import ANOVATest
- >>> dataset = [[2.0, Vectors.dense([0.43486404, 0.57153633, 0.43175686,
- ... 0.51418671, 0.61632374, 0.96565515])],
- ... [1.0, Vectors.dense([0.49162732, 0.6785187, 0.85460572,
- ... 0.59784822, 0.12394819, 0.53783355])],
- ... [2.0, Vectors.dense([0.30879653, 0.54904515, 0.17103889,
- ... 0.40492506, 0.18957493, 0.5440016])],
- ... [3.0, Vectors.dense([0.68114391, 0.60549825, 0.69094651,
- ... 0.62102109, 0.05471483, 0.96449167])]]
- >>> dataset = spark.createDataFrame(dataset, ["label", "features"])
- >>> anovaResult = ANOVATest.test(dataset, 'features', 'label')
- >>> row = anovaResult.select("fValues", "pValues").collect()
- >>> row[0].fValues
- DenseVector([4.0264, 18.4713, 3.4659, 1.9042, 0.5532, 0.512])
- >>> row[0].pValues
- DenseVector([0.3324, 0.1623, 0.3551, 0.456, 0.689, 0.7029])
- >>> anovaResult = ANOVATest.test(dataset, 'features', 'label', True)
- >>> row = anovaResult.orderBy("featureIndex").collect()
- >>> row[0].fValue
- 4.026438671875297
- """
- sc = SparkContext._active_spark_context
- javaTestObj = _jvm().org.apache.spark.ml.stat.ANOVATest
- args = [_py2java(sc, arg) for arg in (dataset, featuresCol, labelCol, flatten)]
- return _java2py(sc, javaTestObj.test(*args))
-
-
-class FValueTest(object):
- """
- Conduct F Regression test for continuous features against continuous labels.
-
- .. versionadded:: 3.1.0
- """
- @staticmethod
- def test(dataset, featuresCol, labelCol, flatten=False):
- """
- Perform a F Regression test using dataset.
-
- .. versionadded:: 3.1.0
-
- Parameters
- ----------
- dataset : :py:class:`pyspark.sql.DataFrame`
- DataFrame of continuous labels and continuous features.
- featuresCol : str
- Name of features column in dataset, of type `Vector` (`VectorUDT`).
- labelCol : str
- Name of label column in dataset, of any numerical type.
- flatten : bool, optional
- if True, flattens the returned dataframe.
-
- Returns
- -------
- :py:class:`pyspark.sql.DataFrame`
- DataFrame containing the test result for every feature against the label.
- If flatten is True, this DataFrame will contain one row per feature with the following
- fields:
-
- - `featureIndex: int`
- - `pValue: float`
- - `degreesOfFreedom: int`
- - `fValue: float`
-
- If flatten is False, this DataFrame will contain a single Row with the following fields:
-
- - `pValues: Vector`
- - `degreesOfFreedom: Array[int]`
- - `fValues: Vector`
-
- Each of these fields has one value per feature.
-
- Examples
- --------
- >>> from pyspark.ml.linalg import Vectors
- >>> from pyspark.ml.stat import FValueTest
- >>> dataset = [[0.57495218, Vectors.dense([0.43486404, 0.57153633, 0.43175686,
- ... 0.51418671, 0.61632374, 0.96565515])],
- ... [0.84619853, Vectors.dense([0.49162732, 0.6785187, 0.85460572,
- ... 0.59784822, 0.12394819, 0.53783355])],
- ... [0.39777647, Vectors.dense([0.30879653, 0.54904515, 0.17103889,
- ... 0.40492506, 0.18957493, 0.5440016])],
- ... [0.79201573, Vectors.dense([0.68114391, 0.60549825, 0.69094651,
- ... 0.62102109, 0.05471483, 0.96449167])]]
- >>> dataset = spark.createDataFrame(dataset, ["label", "features"])
- >>> fValueResult = FValueTest.test(dataset, 'features', 'label')
- >>> row = fValueResult.select("fValues", "pValues").collect()
- >>> row[0].fValues
- DenseVector([3.741, 7.5807, 142.0684, 34.9849, 0.4112, 0.0539])
- >>> row[0].pValues
- DenseVector([0.1928, 0.1105, 0.007, 0.0274, 0.5871, 0.838])
- >>> fValueResult = FValueTest.test(dataset, 'features', 'label', True)
- >>> row = fValueResult.orderBy("featureIndex").collect()
- >>> row[0].fValue
- 3.7409548308350593
- """
- sc = SparkContext._active_spark_context
- javaTestObj = _jvm().org.apache.spark.ml.stat.FValueTest
- args = [_py2java(sc, arg) for arg in (dataset, featuresCol, labelCol, flatten)]
- return _java2py(sc, javaTestObj.test(*args))
-
-
if __name__ == "__main__":
import doctest
import numpy
diff --git a/python/pyspark/ml/stat.pyi b/python/pyspark/ml/stat.pyi
index 83b0f7e..30485a7 100644
--- a/python/pyspark/ml/stat.pyi
+++ b/python/pyspark/ml/stat.pyi
@@ -75,15 +75,3 @@ class MultivariateGaussian:
mean: Vector
cov: Matrix
def __init__(self, mean: Vector, cov: Matrix) -> None: ...
-
-class ANOVATest:
- @staticmethod
- def test(
- dataset: DataFrame, featuresCol: str, labelCol: str, flatten: bool = ...
- ) -> DataFrame: ...
-
-class FValueTest:
- @staticmethod
- def test(
- dataset: DataFrame, featuresCol: str, labelCol: str, flatten: bool = ...
- ) -> DataFrame: ...
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org