You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/11 23:45:16 UTC

[GitHub] [spark] huaxingao opened a new pull request #27882: [SPARK-31127][ML] Add abstract Selector

huaxingao opened a new pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882
 
 
   
   ### What changes were proposed in this pull request?
   Add abstract ```Selector```. Put the common code between ```ChiSqSelector``` and ```FValueSelector``` to ```Selector```.
   
   
   ### Why are the changes needed?
   Code reuse.
   
   
   ### Does this PR introduce any user-facing change?
   Yes.
   Add ```pValues``` and ```statistics``` to ```SelectorModel```
   
   
   ### How was this patch tested?
   New and existing tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599481156
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599380135
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24572/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597975130
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119685/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599380135
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24572/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599390032
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599376249
 
 
   **[Test build #119834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119834/testReport)** for PR 27882 at commit [`eb841a5`](https://github.com/apache/spark/commit/eb841a5148a56a4e814148898408704f40dadcba).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599278543
 
 
   **[Test build #119821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119821/testReport)** for PR 27882 at commit [`b4d30f9`](https://github.com/apache/spark/commit/b4d30f97a0d4bd40bc689f7d1fd961b8ce66d123).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599350499
 
 
   **[Test build #119834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119834/testReport)** for PR 27882 at commit [`eb841a5`](https://github.com/apache/spark/commit/eb841a5148a56a4e814148898408704f40dadcba).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602111427
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120135/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599380129
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598515062
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599472737
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391396804
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala
 ##########
 @@ -154,106 +46,75 @@ private[feature] trait FValueSelectorParams extends Params
  * set to 50.
  */
 @Since("3.1.0")
-final class FValueSelector @Since("3.1.0") (override val uid: String)
-  extends Estimator[FValueSelectorModel] with FValueSelectorParams
-    with DefaultParamsWritable {
+final class FValueSelector @Since("3.1.0") (@Since("3.1.0") override val uid: String) extends
+  Selector[FValueSelectorModel] {
 
   @Since("3.1.0")
   def this() = this(Identifiable.randomUID("FValueSelector"))
 
   /** @group setParam */
   @Since("3.1.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setPercentile(value: Double): this.type = set(percentile, value)
+  override def setPercentile(value: Double): this.type = super.setPercentile(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFpr(value: Double): this.type = set(fpr, value)
+  override def setFpr(value: Double): this.type = super.setFpr(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFdr(value: Double): this.type = set(fdr, value)
+  override def setFdr(value: Double): this.type = super.setFdr(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFwe(value: Double): this.type = set(fwe, value)
+  override def setFwe(value: Double): this.type = super.setFwe(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setSelectorType(value: String): this.type = set(selectorType, value)
+  override def setSelectorType(value: String): this.type = super.setSelectorType(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setOutputCol(value: String): this.type = set(outputCol, value)
+  override def setOutputCol(value: String): this.type = super.setOutputCol(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setLabelCol(value: String): this.type = set(labelCol, value)
+  override def setLabelCol(value: String): this.type = super.setLabelCol(value)
 
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
   @Since("3.1.0")
-  override def fit(dataset: Dataset[_]): FValueSelectorModel = {
-    transformSchema(dataset.schema, logging = true)
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
-
-    val testResult = FValueTest.testRegression(dataset, getFeaturesCol, getLabelCol)
-      .zipWithIndex
-    val features = $(selectorType) match {
-      case "numTopFeatures" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take(getNumTopFeatures)
-      case "percentile" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take((testResult.length * getPercentile).toInt)
-      case "fpr" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFpr }
-      case "fdr" =>
-        // This uses the Benjamini-Hochberg procedure.
-        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
-        val tempRes = testResult
-          .sortBy { case (res, _) => res.pValue }
-        val selected = tempRes
-          .zipWithIndex
-          .filter { case ((res, _), index) =>
-            res.pValue <= getFdr * (index + 1) / testResult.length }
-        if (selected.isEmpty) {
-          Array.empty[(SelectionTestResult, Int)]
-        } else {
-          val maxIndex = selected.map(_._2).max
-          tempRes.take(maxIndex + 1)
-        }
-      case "fwe" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
-      case errorType =>
-        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
-    }
-    val indices = features.map { case (_, index) => index }
-    copyValues(new FValueSelectorModel(uid, indices.sorted)
-      .setParent(this))
+  protected[this] override def getSelectionTestResult(dataset: Dataset[_]):
 
 Review comment:
   ditto

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599278715
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597974692
 
 
   **[Test build #119685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119685/testReport)** for PR 27882 at commit [`456e688`](https://github.com/apache/spark/commit/456e6883cb280f360bc0b556082761a97e8985ee).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `final class FValueSelector @Since(\"3.1.0\") (@Since(\"3.1.0\") override val uid: String) extends`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599376452
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599302165
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119821/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599278719
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24551/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599350499
 
 
   **[Test build #119834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119834/testReport)** for PR 27882 at commit [`eb841a5`](https://github.com/apache/spark/commit/eb841a5148a56a4e814148898408704f40dadcba).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602155620
 
 
   Please see https://github.com/apache/spark/pull/27978

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599392412
 
 
   **[Test build #119851 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119851/testReport)** for PR 27882 at commit [`4cee276`](https://github.com/apache/spark/commit/4cee27668e3025d455248da7541663c5d86d7fc7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599278715
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391399664
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquareTest.scala
 ##########
 @@ -77,31 +75,4 @@ object ChiSquareTest {
     val statistics: Vector = Vectors.dense(testResults.map(_.statistic))
     spark.createDataFrame(Seq(ChiSquareResult(pValues, degreesOfFreedom, statistics)))
   }
-
-  /**
-   * @param dataset  DataFrame of categorical labels and categorical features.
-   *                 Real-valued features will be treated as categorical for each distinct value.
-   * @param featuresCol  Name of features column in dataset, of type `Vector` (`VectorUDT`)
-   * @param labelCol  Name of label column in dataset, of any numerical type
-   * @return Array containing the SelectionTestResult for every feature against the label.
-   */
-  @Since("3.1.0")
 
 Review comment:
   Why removing this public method?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391397045
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala
 ##########
 @@ -344,57 +156,31 @@ class FValueSelectorModel private[ml](
 
   @Since("3.1.0")
   override def toString: String = {
-    s"FValueSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
-  }
-
-  private[spark] def compressSparse(
-      indices: Array[Int],
-      values: Array[Double]): (Array[Int], Array[Double]) = {
-    val newValues = new ArrayBuilder.ofDouble
-    val newIndices = new ArrayBuilder.ofInt
-    var i = 0
-    var j = 0
-    var indicesIdx = 0
-    var filterIndicesIdx = 0
-    while (i < indices.length && j < selectedFeatures.length) {
-      indicesIdx = indices(i)
-      filterIndicesIdx = selectedFeatures(j)
-      if (indicesIdx == filterIndicesIdx) {
-        newIndices += j
-        newValues += values(i)
-        j += 1
-        i += 1
-      } else {
-        if (indicesIdx > filterIndicesIdx) {
-          j += 1
-        } else {
-          i += 1
-        }
-      }
-    }
-    // TODO: Sparse representation might be ineffective if (newSize ~= newValues.size)
-    (newIndices.result(), newValues.result())
+    s"FValueModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
   }
 }
 
 @Since("3.1.0")
 object FValueSelectorModel extends MLReadable[FValueSelectorModel] {
 
   @Since("3.1.0")
-  override def read: MLReader[FValueSelectorModel] =
-    new FValueSelectorModelReader
+  override def read: MLReader[FValueSelectorModel] = new FValueSelectorModelReader
 
   @Since("3.1.0")
   override def load(path: String): FValueSelectorModel = super.load(path)
 
   private[FValueSelectorModel] class FValueSelectorModelWriter(
-      instance: FValueSelectorModel) extends MLWriter {
+      instance:
+      FValueSelectorModel) extends MLWriter {
 
 Review comment:
   lint

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599350818
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598515062
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391957647
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/stat/SelectionTest.scala
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import org.apache.commons.math3.distribution.FDistribution
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DoubleType
+
+
+@Since("3.1.0")
+object SelectionTest {
 
 Review comment:
   I will delete this class

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391398559
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
 
 Review comment:
   We may do not need to override these methods in subclasses?
   what about just removing the since annotation like we did in shared params?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599471695
 
 
   **[Test build #119842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119842/testReport)** for PR 27882 at commit [`eb841a5`](https://github.com/apache/spark/commit/eb841a5148a56a4e814148898408704f40dadcba).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599481165
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119851/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599289509
 
 
   **[Test build #119823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119823/testReport)** for PR 27882 at commit [`b219f0a`](https://github.com/apache/spark/commit/b219f0af6713c77b162db50b72f0b42d2c420818).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597975130
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119685/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598515067
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119731/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598515067
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119731/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602109383
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598477279
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24458/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599379096
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598479027
 
 
   **[Test build #119731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119731/testReport)** for PR 27882 at commit [`1965424`](https://github.com/apache/spark/commit/1965424164d2c2ff7d4287ef3c88f2174f8a1dc2).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391957450
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
 
 Review comment:
   I think I got an error if I don't override. I will double check.
   Do you mean to remove since annotation for the params, and getXXX, but keep the annotation for setXXX? I think that's how the shared params handle since annotation. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599472737
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599301772
 
 
   **[Test build #119821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119821/testReport)** for PR 27882 at commit [`b4d30f9`](https://github.com/apache/spark/commit/b4d30f97a0d4bd40bc689f7d1fd961b8ce66d123).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597975127
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599302162
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599380129
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599379631
 
 
   **[Test build #119842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119842/testReport)** for PR 27882 at commit [`eb841a5`](https://github.com/apache/spark/commit/eb841a5148a56a4e814148898408704f40dadcba).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599278543
 
 
   **[Test build #119821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119821/testReport)** for PR 27882 at commit [`b4d30f9`](https://github.com/apache/spark/commit/b4d30f97a0d4bd40bc689f7d1fd961b8ce66d123).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599326600
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119823/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391398684
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setPercentile(value: Double): this.type = set(percentile, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFpr(value: Double): this.type = set(fpr, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFdr(value: Double): this.type = set(fdr, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFwe(value: Double): this.type = set(fwe, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setSelectorType(value: String): this.type = set(selectorType, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setLabelCol(value: String): this.type = set(labelCol, value)
+
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
+  protected[this] def getSelectionTestResult(dataset: Dataset[_]): Array[SelectionTestResult]
+
+  /**
+   * Create a new instance of concrete SelectorModel.
+   * @param indices The indices of the selected features
+   * @param pValues The pValues of the selected features
+   * @param statistics The statistics of the selected features
+   * @return A new SelectorModel instance
+   */
+  @Since("3.1.0")
+  protected[this] def createSelectorModel(
 
 Review comment:
   ditto

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599302162
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391397568
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
 
 Review comment:
   `ANOVA F-value` is yet not implemented now, so what about implementing it first and then make such abstraction to reduce common code?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599376456
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119834/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391395961
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
 ##########
 @@ -153,102 +48,75 @@ private[feature] trait ChiSqSelectorParams extends Params
  */
 @Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
-  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams with DefaultParamsWritable {
+  extends Selector[ChiSqSelectorModel] {
 
   @Since("1.6.0")
   def this() = this(Identifiable.randomUID("chiSqSelector"))
 
   /** @group setParam */
   @Since("1.6.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setPercentile(value: Double): this.type = set(percentile, value)
+  override def setPercentile(value: Double): this.type = super.setPercentile(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setFpr(value: Double): this.type = set(fpr, value)
+  override def setFpr(value: Double): this.type = super.setFpr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFdr(value: Double): this.type = set(fdr, value)
+  override def setFdr(value: Double): this.type = super.setFdr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFwe(value: Double): this.type = set(fwe, value)
+  override def setFwe(value: Double): this.type = super.setFwe(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setSelectorType(value: String): this.type = set(selectorType, value)
+  override def setSelectorType(value: String): this.type = super.setSelectorType(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setOutputCol(value: String): this.type = set(outputCol, value)
+  override def setOutputCol(value: String): this.type = super.setOutputCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setLabelCol(value: String): this.type = set(labelCol, value)
-
-  @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): ChiSqSelectorModel = {
-    transformSchema(dataset.schema, logging = true)
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
+  override def setLabelCol(value: String): this.type = super.setLabelCol(value)
 
-    val testResult = ChiSquareTest.testChiSquare(dataset, getFeaturesCol, getLabelCol)
-      .zipWithIndex
-    val features = $(selectorType) match {
-      case "numTopFeatures" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take(getNumTopFeatures)
-      case "percentile" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take((testResult.length * getPercentile).toInt)
-      case "fpr" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFpr }
-      case "fdr" =>
-        // This uses the Benjamini-Hochberg procedure.
-        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
-        val tempRes = testResult
-          .sortBy { case (res, _) => res.pValue }
-        val selected = tempRes
-          .zipWithIndex
-          .filter { case ((res, _), index) =>
-            res.pValue <= getFdr * (index + 1) / testResult.length }
-        if (selected.isEmpty) {
-          Array.empty[(SelectionTestResult, Int)]
-        } else {
-          val maxIndex = selected.map(_._2).max
-          tempRes.take(maxIndex + 1)
-        }
-      case "fwe" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
-      case errorType =>
-        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
-    }
-    val indices = features.map { case (_, index) => index }
-    copyValues(new ChiSqSelectorModel(uid, indices.sorted)
-      .setParent(this))
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
+  protected[this] override def getSelectionTestResult(dataset: Dataset[_]):
+  Array[SelectionTestResult] = {
+    SelectionTest.chiSquareTest(dataset, getFeaturesCol, getLabelCol)
   }
 
-  @Since("1.6.0")
-  override def transformSchema(schema: StructType): StructType = {
-    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
-    SchemaUtils.checkNumericType(schema, $(labelCol))
-    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  /**
+   * Create a new instance of concrete SelectorModel.
+   * @param indices The indices of the selected features
+   * @param pValues The pValues of the selected features
+   * @param statistics The chi square statistic of the selected features
+   * @return A new SelectorModel instance
+   */
+  @Since("3.1.0")
+  protected[this] def createSelectorModel(
 
 Review comment:
   ditto

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598479321
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599376456
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119834/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597975127
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598488263
 
 
   > ANOVA F-value is yet not implemented now, so what about implementing it first and then make such abstraction to reduce common code?
   
   I submitted https://github.com/apache/spark/pull/27895 for ANOVASelector.  I will mark this PR as WIP for now. After ANOVASelector is done, I will add ANOVA in this PR too. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599326597
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602111427
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120135/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391396309
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
 ##########
 @@ -266,72 +134,27 @@ object ChiSqSelector extends DefaultParamsReadable[ChiSqSelector] {
 @Since("1.6.0")
 final class ChiSqSelectorModel private[ml] (
     @Since("1.6.0") override val uid: String,
-    @Since("3.1.0") val selectedFeatures: Array[Int])
-  extends Model[ChiSqSelectorModel] with ChiSqSelectorParams with MLWritable {
+    @Since("3.1.0") override val selectedFeatures: Array[Int],
+    @Since("3.1.0") override val pValues: Array[Double],
+    @Since("3.1.0")override val statistic: Array[Double])
 
 Review comment:
   nit: space before `override`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597936162
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24414/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599278719
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24551/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599350822
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24564/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598479325
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24459/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597936162
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24414/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599289678
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24553/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597935758
 
 
   **[Test build #119685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119685/testReport)** for PR 27882 at commit [`456e688`](https://github.com/apache/spark/commit/456e6883cb280f360bc0b556082761a97e8985ee).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599376452
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602109175
 
 
   **[Test build #120135 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120135/testReport)** for PR 27882 at commit [`89860b5`](https://github.com/apache/spark/commit/89860b5ee99e632cdcd9dab705260feee3e39bb7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599326597
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599392412
 
 
   **[Test build #119851 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119851/testReport)** for PR 27882 at commit [`4cee276`](https://github.com/apache/spark/commit/4cee27668e3025d455248da7541663c5d86d7fc7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599390040
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24581/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602111425
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602109387
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24849/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391398635
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setPercentile(value: Double): this.type = set(percentile, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFpr(value: Double): this.type = set(fpr, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFdr(value: Double): this.type = set(fdr, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFwe(value: Double): this.type = set(fwe, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setSelectorType(value: String): this.type = set(selectorType, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setLabelCol(value: String): this.type = set(labelCol, value)
+
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
+  protected[this] def getSelectionTestResult(dataset: Dataset[_]): Array[SelectionTestResult]
 
 Review comment:
   ditto

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602111425
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599481156
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599326600
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119823/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598479325
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24459/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao closed pull request #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao closed pull request #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391398237
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
 
 Review comment:
   `Selector` already extends `Estimator`, so do we really need this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598477273
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391957054
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
 ##########
 @@ -153,102 +48,75 @@ private[feature] trait ChiSqSelectorParams extends Params
  */
 @Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
-  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams with DefaultParamsWritable {
+  extends Selector[ChiSqSelectorModel] {
 
   @Since("1.6.0")
   def this() = this(Identifiable.randomUID("chiSqSelector"))
 
   /** @group setParam */
   @Since("1.6.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setPercentile(value: Double): this.type = set(percentile, value)
+  override def setPercentile(value: Double): this.type = super.setPercentile(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setFpr(value: Double): this.type = set(fpr, value)
+  override def setFpr(value: Double): this.type = super.setFpr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFdr(value: Double): this.type = set(fdr, value)
+  override def setFdr(value: Double): this.type = super.setFdr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFwe(value: Double): this.type = set(fwe, value)
+  override def setFwe(value: Double): this.type = super.setFwe(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setSelectorType(value: String): this.type = set(selectorType, value)
+  override def setSelectorType(value: String): this.type = super.setSelectorType(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setOutputCol(value: String): this.type = set(outputCol, value)
+  override def setOutputCol(value: String): this.type = super.setOutputCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setLabelCol(value: String): this.type = set(labelCol, value)
-
-  @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): ChiSqSelectorModel = {
-    transformSchema(dataset.schema, logging = true)
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
+  override def setLabelCol(value: String): this.type = super.setLabelCol(value)
 
-    val testResult = ChiSquareTest.testChiSquare(dataset, getFeaturesCol, getLabelCol)
-      .zipWithIndex
-    val features = $(selectorType) match {
-      case "numTopFeatures" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take(getNumTopFeatures)
-      case "percentile" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take((testResult.length * getPercentile).toInt)
-      case "fpr" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFpr }
-      case "fdr" =>
-        // This uses the Benjamini-Hochberg procedure.
-        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
-        val tempRes = testResult
-          .sortBy { case (res, _) => res.pValue }
-        val selected = tempRes
-          .zipWithIndex
-          .filter { case ((res, _), index) =>
-            res.pValue <= getFdr * (index + 1) / testResult.length }
-        if (selected.isEmpty) {
-          Array.empty[(SelectionTestResult, Int)]
-        } else {
-          val maxIndex = selected.map(_._2).max
-          tempRes.take(maxIndex + 1)
-        }
-      case "fwe" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
-      case errorType =>
-        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
-    }
-    val indices = features.map { case (_, index) => index }
-    copyValues(new ChiSqSelectorModel(uid, indices.sorted)
-      .setParent(this))
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
+  protected[this] override def getSelectionTestResult(dataset: Dataset[_]):
+  Array[SelectionTestResult] = {
+    SelectionTest.chiSquareTest(dataset, getFeaturesCol, getLabelCol)
   }
 
-  @Since("1.6.0")
-  override def transformSchema(schema: StructType): StructType = {
-    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
-    SchemaUtils.checkNumericType(schema, $(labelCol))
-    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  /**
+   * Create a new instance of concrete SelectorModel.
+   * @param indices The indices of the selected features
+   * @param pValues The pValues of the selected features
+   * @param statistics The chi square statistic of the selected features
+   * @return A new SelectorModel instance
+   */
+  @Since("3.1.0")
+  protected[this] def createSelectorModel(
 
 Review comment:
   I am not sure if I can delete this method. I have problem doing the following in the end of ```Selector.fit```
   ```
   copyValues(new SelectorModel(uid, indices, pValues, statistic).setParent(this))
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599289678
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24553/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391400190
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/stat/SelectionTest.scala
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import org.apache.commons.math3.distribution.FDistribution
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DoubleType
+
+
+@Since("3.1.0")
+object SelectionTest {
 
 Review comment:
   I am not sure whether we need this, since there already are `FValueTest` and `ChiSquareTest`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602109387
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24849/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602109175
 
 
   **[Test build #120135 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120135/testReport)** for PR 27882 at commit [`89860b5`](https://github.com/apache/spark/commit/89860b5ee99e632cdcd9dab705260feee3e39bb7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597936158
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602111416
 
 
   **[Test build #120135 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120135/testReport)** for PR 27882 at commit [`89860b5`](https://github.com/apache/spark/commit/89860b5ee99e632cdcd9dab705260feee3e39bb7).
    * This patch **fails MiMa tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598479027
 
 
   **[Test build #119731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119731/testReport)** for PR 27882 at commit [`1965424`](https://github.com/apache/spark/commit/1965424164d2c2ff7d4287ef3c88f2174f8a1dc2).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599480216
 
 
   **[Test build #119851 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119851/testReport)** for PR 27882 at commit [`4cee276`](https://github.com/apache/spark/commit/4cee27668e3025d455248da7541663c5d86d7fc7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599390040
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24581/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599350822
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24564/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599289673
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597936158
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599472747
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119842/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599472747
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119842/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599350818
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599289673
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391396858
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/FValueSelector.scala
 ##########
 @@ -154,106 +46,75 @@ private[feature] trait FValueSelectorParams extends Params
  * set to 50.
  */
 @Since("3.1.0")
-final class FValueSelector @Since("3.1.0") (override val uid: String)
-  extends Estimator[FValueSelectorModel] with FValueSelectorParams
-    with DefaultParamsWritable {
+final class FValueSelector @Since("3.1.0") (@Since("3.1.0") override val uid: String) extends
+  Selector[FValueSelectorModel] {
 
   @Since("3.1.0")
   def this() = this(Identifiable.randomUID("FValueSelector"))
 
   /** @group setParam */
   @Since("3.1.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setPercentile(value: Double): this.type = set(percentile, value)
+  override def setPercentile(value: Double): this.type = super.setPercentile(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFpr(value: Double): this.type = set(fpr, value)
+  override def setFpr(value: Double): this.type = super.setFpr(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFdr(value: Double): this.type = set(fdr, value)
+  override def setFdr(value: Double): this.type = super.setFdr(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFwe(value: Double): this.type = set(fwe, value)
+  override def setFwe(value: Double): this.type = super.setFwe(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setSelectorType(value: String): this.type = set(selectorType, value)
+  override def setSelectorType(value: String): this.type = super.setSelectorType(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setOutputCol(value: String): this.type = set(outputCol, value)
+  override def setOutputCol(value: String): this.type = super.setOutputCol(value)
 
   /** @group setParam */
   @Since("3.1.0")
-  def setLabelCol(value: String): this.type = set(labelCol, value)
+  override def setLabelCol(value: String): this.type = super.setLabelCol(value)
 
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
   @Since("3.1.0")
-  override def fit(dataset: Dataset[_]): FValueSelectorModel = {
-    transformSchema(dataset.schema, logging = true)
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
-
-    val testResult = FValueTest.testRegression(dataset, getFeaturesCol, getLabelCol)
-      .zipWithIndex
-    val features = $(selectorType) match {
-      case "numTopFeatures" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take(getNumTopFeatures)
-      case "percentile" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take((testResult.length * getPercentile).toInt)
-      case "fpr" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFpr }
-      case "fdr" =>
-        // This uses the Benjamini-Hochberg procedure.
-        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
-        val tempRes = testResult
-          .sortBy { case (res, _) => res.pValue }
-        val selected = tempRes
-          .zipWithIndex
-          .filter { case ((res, _), index) =>
-            res.pValue <= getFdr * (index + 1) / testResult.length }
-        if (selected.isEmpty) {
-          Array.empty[(SelectionTestResult, Int)]
-        } else {
-          val maxIndex = selected.map(_._2).max
-          tempRes.take(maxIndex + 1)
-        }
-      case "fwe" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
-      case errorType =>
-        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
-    }
-    val indices = features.map { case (_, index) => index }
-    copyValues(new FValueSelectorModel(uid, indices.sorted)
-      .setParent(this))
+  protected[this] override def getSelectionTestResult(dataset: Dataset[_]):
+  Array[SelectionTestResult] = {
+    SelectionTest.fValueTest(dataset, getFeaturesCol, getLabelCol)
   }
 
+  /**
+   * Create a new instance of concrete SelectorModel.
+   * @param indices The indices of the selected features
+   * @param pValues The pValues of the selected features
+   * @param statistics The f value of the selected features
+   * @return A new SelectorModel instance
+   */
   @Since("3.1.0")
-  override def transformSchema(schema: StructType): StructType = {
-    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
-    SchemaUtils.checkNumericType(schema, $(labelCol))
-    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  protected[this] def createSelectorModel(
 
 Review comment:
   ditto

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-597935758
 
 
   **[Test build #119685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119685/testReport)** for PR 27882 at commit [`456e688`](https://github.com/apache/spark/commit/456e6883cb280f360bc0b556082761a97e8985ee).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598479321
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598477279
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24458/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599481165
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119851/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-602109383
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599390032
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599379631
 
 
   **[Test build #119842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119842/testReport)** for PR 27882 at commit [`eb841a5`](https://github.com/apache/spark/commit/eb841a5148a56a4e814148898408704f40dadcba).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391395690
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
 ##########
 @@ -153,102 +48,75 @@ private[feature] trait ChiSqSelectorParams extends Params
  */
 @Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
-  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams with DefaultParamsWritable {
+  extends Selector[ChiSqSelectorModel] {
 
   @Since("1.6.0")
   def this() = this(Identifiable.randomUID("chiSqSelector"))
 
   /** @group setParam */
   @Since("1.6.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setPercentile(value: Double): this.type = set(percentile, value)
+  override def setPercentile(value: Double): this.type = super.setPercentile(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setFpr(value: Double): this.type = set(fpr, value)
+  override def setFpr(value: Double): this.type = super.setFpr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFdr(value: Double): this.type = set(fdr, value)
+  override def setFdr(value: Double): this.type = super.setFdr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFwe(value: Double): this.type = set(fwe, value)
+  override def setFwe(value: Double): this.type = super.setFwe(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setSelectorType(value: String): this.type = set(selectorType, value)
+  override def setSelectorType(value: String): this.type = super.setSelectorType(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setOutputCol(value: String): this.type = set(outputCol, value)
+  override def setOutputCol(value: String): this.type = super.setOutputCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setLabelCol(value: String): this.type = set(labelCol, value)
-
-  @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): ChiSqSelectorModel = {
-    transformSchema(dataset.schema, logging = true)
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
+  override def setLabelCol(value: String): this.type = super.setLabelCol(value)
 
-    val testResult = ChiSquareTest.testChiSquare(dataset, getFeaturesCol, getLabelCol)
-      .zipWithIndex
-    val features = $(selectorType) match {
-      case "numTopFeatures" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take(getNumTopFeatures)
-      case "percentile" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take((testResult.length * getPercentile).toInt)
-      case "fpr" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFpr }
-      case "fdr" =>
-        // This uses the Benjamini-Hochberg procedure.
-        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
-        val tempRes = testResult
-          .sortBy { case (res, _) => res.pValue }
-        val selected = tempRes
-          .zipWithIndex
-          .filter { case ((res, _), index) =>
-            res.pValue <= getFdr * (index + 1) / testResult.length }
-        if (selected.isEmpty) {
-          Array.empty[(SelectionTestResult, Int)]
-        } else {
-          val maxIndex = selected.map(_._2).max
-          tempRes.take(maxIndex + 1)
-        }
-      case "fwe" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
-      case errorType =>
-        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
-    }
-    val indices = features.map { case (_, index) => index }
-    copyValues(new ChiSqSelectorModel(uid, indices.sorted)
-      .setParent(this))
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
 
 Review comment:
   protected method do not need since annotation

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598514618
 
 
   **[Test build #119731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119731/testReport)** for PR 27882 at commit [`1965424`](https://github.com/apache/spark/commit/1965424164d2c2ff7d4287ef3c88f2174f8a1dc2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391396136
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
 ##########
 @@ -153,102 +48,75 @@ private[feature] trait ChiSqSelectorParams extends Params
  */
 @Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
-  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams with DefaultParamsWritable {
+  extends Selector[ChiSqSelectorModel] {
 
   @Since("1.6.0")
   def this() = this(Identifiable.randomUID("chiSqSelector"))
 
   /** @group setParam */
   @Since("1.6.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  override def setNumTopFeatures(value: Int): this.type = super.setNumTopFeatures(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setPercentile(value: Double): this.type = set(percentile, value)
+  override def setPercentile(value: Double): this.type = super.setPercentile(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setFpr(value: Double): this.type = set(fpr, value)
+  override def setFpr(value: Double): this.type = super.setFpr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFdr(value: Double): this.type = set(fdr, value)
+  override def setFdr(value: Double): this.type = super.setFdr(value)
 
   /** @group setParam */
   @Since("2.2.0")
-  def setFwe(value: Double): this.type = set(fwe, value)
+  override def setFwe(value: Double): this.type = super.setFwe(value)
 
   /** @group setParam */
   @Since("2.1.0")
-  def setSelectorType(value: String): this.type = set(selectorType, value)
+  override def setSelectorType(value: String): this.type = super.setSelectorType(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def setFeaturesCol(value: String): this.type = super.setFeaturesCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setOutputCol(value: String): this.type = set(outputCol, value)
+  override def setOutputCol(value: String): this.type = super.setOutputCol(value)
 
   /** @group setParam */
   @Since("1.6.0")
-  def setLabelCol(value: String): this.type = set(labelCol, value)
-
-  @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): ChiSqSelectorModel = {
-    transformSchema(dataset.schema, logging = true)
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
+  override def setLabelCol(value: String): this.type = super.setLabelCol(value)
 
-    val testResult = ChiSquareTest.testChiSquare(dataset, getFeaturesCol, getLabelCol)
-      .zipWithIndex
-    val features = $(selectorType) match {
-      case "numTopFeatures" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take(getNumTopFeatures)
-      case "percentile" =>
-        testResult
-          .sortBy { case (res, _) => res.pValue }
-          .take((testResult.length * getPercentile).toInt)
-      case "fpr" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFpr }
-      case "fdr" =>
-        // This uses the Benjamini-Hochberg procedure.
-        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
-        val tempRes = testResult
-          .sortBy { case (res, _) => res.pValue }
-        val selected = tempRes
-          .zipWithIndex
-          .filter { case ((res, _), index) =>
-            res.pValue <= getFdr * (index + 1) / testResult.length }
-        if (selected.isEmpty) {
-          Array.empty[(SelectionTestResult, Int)]
-        } else {
-          val maxIndex = selected.map(_._2).max
-          tempRes.take(maxIndex + 1)
-        }
-      case "fwe" =>
-        testResult
-          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
-      case errorType =>
-        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
-    }
-    val indices = features.map { case (_, index) => index }
-    copyValues(new ChiSqSelectorModel(uid, indices.sorted)
-      .setParent(this))
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
+  protected[this] override def getSelectionTestResult(dataset: Dataset[_]):
+  Array[SelectionTestResult] = {
+    SelectionTest.chiSquareTest(dataset, getFeaturesCol, getLabelCol)
   }
 
-  @Since("1.6.0")
-  override def transformSchema(schema: StructType): StructType = {
-    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
-    SchemaUtils.checkNumericType(schema, $(labelCol))
-    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  /**
+   * Create a new instance of concrete SelectorModel.
+   * @param indices The indices of the selected features
+   * @param pValues The pValues of the selected features
+   * @param statistics The chi square statistic of the selected features
+   * @return A new SelectorModel instance
+   */
+  @Since("3.1.0")
+  protected[this] def createSelectorModel(
 
 Review comment:
   this method is just a constructor, so I think we do not need it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391332980
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setPercentile(value: Double): this.type = set(percentile, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFpr(value: Double): this.type = set(fpr, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFdr(value: Double): this.type = set(fdr, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFwe(value: Double): this.type = set(fwe, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setSelectorType(value: String): this.type = set(selectorType, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /** @group setParam */
+  @Since("3.1.0")
+  def setLabelCol(value: String): this.type = set(labelCol, value)
+
+  /**
+   * get the SelectionTestResult for every feature against the label
+   */
+  @Since("3.1.0")
+  protected[this] def getSelectionTestResult(dataset: Dataset[_]): Array[SelectionTestResult]
+
+  /**
+   * Create a new instance of concrete SelectorModel.
+   * @param indices The indices of the selected features
+   * @param pValues The pValues of the selected features
+   * @param statistics The statistics of the selected features
+   * @return A new SelectorModel instance
+   */
+  @Since("3.1.0")
+  protected[this] def createSelectorModel(
+      uid: String,
+      indices: Array[Int],
+      pValues: Array[Double],
+      statistics: Array[Double]): T
+
+  @Since("3.1.0")
+  override def fit(dataset: Dataset[_]): T = {
+    transformSchema(dataset.schema, logging = true)
+    val input: RDD[LabeledPoint] =
+      dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {
+        case Row(label: Double, features: Vector) =>
+          LabeledPoint(label, features)
+      }
+
+    val testResult = getSelectionTestResult(dataset)
+      .zipWithIndex
+    val features = $(selectorType) match {
+      case "numTopFeatures" =>
+        testResult
+          .sortBy { case (res, _) => res.pValue }
+          .take(getNumTopFeatures)
+      case "percentile" =>
+        testResult
+          .sortBy { case (res, _) => res.pValue }
+          .take((testResult.length * getPercentile).toInt)
+      case "fpr" =>
+        testResult
+          .filter { case (res, _) => res.pValue < getFpr }
+      case "fdr" =>
+        // This uses the Benjamini-Hochberg procedure.
+        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
+        val tempRes = testResult
+          .sortBy { case (res, _) => res.pValue }
+        val selected = tempRes
+          .zipWithIndex
+          .filter { case ((res, _), index) =>
+            res.pValue <= getFdr * (index + 1) / testResult.length }
+        if (selected.isEmpty) {
+          Array.empty[(SelectionTestResult, Int)]
+        } else {
+          val maxIndex = selected.map(_._2).max
+          tempRes.take(maxIndex + 1)
+        }
+      case "fwe" =>
+        testResult
+          .filter { case (res, _) => res.pValue < getFwe / testResult.length }
+      case errorType =>
+        throw new IllegalStateException(s"Unknown Selector Type: $errorType")
+    }
+    val indices = features.map { case (_, index) => index }
+    val pValues = features.map(_._1.pValue)
+    val statistic = features.map(_._1.statistic)
+    copyValues(createSelectorModel(uid, indices.sorted, pValues, statistic).setParent(this))
+  }
+
+  @Since("3.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+    SchemaUtils.checkNumericType(schema, $(labelCol))
+    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+
+  @Since("3.1.0")
+  override def copy(extra: ParamMap): Selector[T] = defaultCopy(extra)
+}
+
+/**
+ * Model fitted by [[Selector]].
+ */
+@Since("3.1.0")
+private[ml] abstract class SelectorModel[T <: SelectorModel[T]] (
+    @Since("3.1.0") val uid: String,
+    @Since("3.1.0") val selectedFeatures: Array[Int],
+    @Since("3.1.0") val pValues: Array[Double],
+    @Since("3.1.0") val statistic: Array[Double])
 
 Review comment:
   ```pValues``` and ```statistic``` are useful information. I will put these in model. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599289509
 
 
   **[Test build #119823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119823/testReport)** for PR 27882 at commit [`b219f0a`](https://github.com/apache/spark/commit/b219f0af6713c77b162db50b72f0b42d2c420818).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599302165
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119821/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27882: [WIP][SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-599326224
 
 
   **[Test build #119823 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119823/testReport)** for PR 27882 at commit [`b219f0a`](https://github.com/apache/spark/commit/b219f0af6713c77b162db50b72f0b42d2c420818).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#issuecomment-598477273
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27882: [SPARK-31127][ML] Add abstract Selector
URL: https://github.com/apache/spark/pull/27882#discussion_r391957230
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala
 ##########
 @@ -0,0 +1,379 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml._
+import org.apache.spark.ml.attribute.{AttributeGroup, _}
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.stat.SelectionTestResult
+import org.apache.spark.ml.util._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+
+/**
+ * Params for [[Selector]] and [[SelectorModel]].
+ */
+private[feature] trait SelectorParams extends Params
+  with HasFeaturesCol with HasOutputCol with HasLabelCol {
+
+  /**
+   * Number of features that selector will select, ordered by ascending p-value. If the
+   * number of features is less than numTopFeatures, then this will select all features.
+   * Only applicable when selectorType = "numTopFeatures".
+   * The default value of numTopFeatures is 50.
+   *
+   * @group param
+   */
+  @Since("3.1.0")
+  final val numTopFeatures = new IntParam(this, "numTopFeatures",
+    "Number of features that selector will select, ordered by ascending p-value. If the" +
+      " number of features is < numTopFeatures, then this will select all features.",
+    ParamValidators.gtEq(1))
+  setDefault(numTopFeatures -> 50)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getNumTopFeatures: Int = $(numTopFeatures)
+
+  /**
+   * Percentile of features that selector will select, ordered by ascending p-value.
+   * Only applicable when selectorType = "percentile".
+   * Default value is 0.1.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val percentile = new DoubleParam(this, "percentile",
+    "Percentile of features that selector will select, ordered by ascending p-value.",
+    ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getPercentile: Double = $(percentile)
+
+  /**
+   * The highest p-value for features to be kept.
+   * Only applicable when selectorType = "fpr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fpr = new DoubleParam(this, "fpr", "The higest p-value for features to be kept.",
+    ParamValidators.inRange(0, 1))
+  setDefault(fpr -> 0.05)
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getFpr: Double = $(fpr)
+
+  /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("3.1.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
+   * The selector type.
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"
+   * @group param
+   */
+  @Since("3.1.0")
+  final val selectorType = new Param[String](this, "selectorType",
+    "The selector type. Supported options: numTopFeatures, percentile, fpr, fdr, fwe",
+    ParamValidators.inArray(Array("numTopFeatures", "percentile", "fpr", "fdr",
+      "fwe")))
+  setDefault(selectorType -> "numTopFeatures")
+
+  /** @group getParam */
+  @Since("3.1.0")
+  def getSelectorType: String = $(selectorType)
+}
+
+/**
+ * Super class for all the feature selectors. The following selectors are supported:
+ * 1. Chi-Square Selector
+ * This feature selector is for categorical features and categorical labels.
+ * 2. ANOVA F-value Classification Selector
+ * This feature selector is for continuous features and categorical labels.
+ * 3. Regression F-value Selector
+ * This feature selector is for continuous features and continuous labels.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
+ *  - `numTopFeatures` chooses a fixed number of top features according to a hypothesis.
+ *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+ *  - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
+ *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ *    1/numFeatures, thus controlling the family-wise error rate of selection.
+ * By default, the selection method is `numTopFeatures`, with the default number of top features
+ * set to 50.
+ */
+@Since("3.1.0")
+private[ml] abstract class Selector[T <: SelectorModel[T]]
+  extends Estimator[T] with SelectorParams with DefaultParamsWritable {
+  self: Estimator[T] =>
 
 Review comment:
   Will delete

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org