You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2015/04/28 10:09:15 UTC

[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/5742

    [SPARK-6530][ML] Add chi-square selector for ml package

    See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark SPARK-6530

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5742.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5742
    
----
commit 92fef9edaffff692d17a49685cb32b5947e17373
Author: Xusen Yin <yi...@gmail.com>
Date:   2015-04-28T08:05:57Z

    add chi-square selector for ml package

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142497972
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142499651
  
      [Test build #42888 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42888/console) for   PR 5742 at commit [`a7d983f`](https://github.com/apache/spark/commit/a7d983f38e26e20a690577dd158d4d42f1fd5682).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class ChiSqSelector(override val uid: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142499670
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42888/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-139070531
  
    @yinxusen Checking back for updates.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5742


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5742#discussion_r40964421
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.sql.{Row, SQLContext}
    +
    +class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  test("Test Chi-Square selector") {
    +    val sqlContext = new SQLContext(sc)
    +    import sqlContext.implicits._
    +
    +    val data = Seq(
    +      LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))),
    +      LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))),
    +      LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))),
    +      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))
    +    )
    +
    +    val preFilteredData = Seq(
    +      Vectors.dense(0.0),
    +      Vectors.dense(6.0),
    +      Vectors.dense(8.0),
    +      Vectors.dense(5.0)
    +    )
    +
    +    val df = sc.parallelize(data.zip(preFilteredData))
    +      .map(x => (x._1.label, x._1.features, x._2))
    +      .toDF("label", "data", "preFilteredData")
    +
    +    val model = new ChiSqSelector()
    +      .setNumTopFeatures(1)
    +      .setFeaturesCol("data")
    +      .setLabelCol("label")
    +      .setOutputCol("expected")
    --- End diff --
    
    rename: "expected" --> "filtered"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144844042
  
    Done with pass.  (and back from traveling, so will review faster now!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142497960
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142358242
  
    @yinxusen If you won't be able to keep working on this, please let me know.  Someone else or I can take over.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142498181
  
      [Test build #42888 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42888/consoleFull) for   PR 5742 at commit [`a7d983f`](https://github.com/apache/spark/commit/a7d983f38e26e20a690577dd158d4d42f1fd5682).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144900359
  
      [Test build #43166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43166/console) for   PR 5742 at commit [`f552028`](https://github.com/apache/spark/commit/f552028db9c18cf850ed934942d208692ec26ab9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class ChiSqSelector(override val uid: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5742#discussion_r40964419
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.sql.{Row, SQLContext}
    +
    +class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext {
    +  test("Test Chi-Square selector") {
    +    val sqlContext = new SQLContext(sc)
    --- End diff --
    
    SQLContext.getOrCreate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5742#discussion_r40964378
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.ml._
    +import org.apache.spark.ml.attribute.{AttributeGroup, _}
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.Identifiable
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/**
    + * Params for [[ChiSqSelector]] and [[ChiSqSelectorModel]].
    + */
    +private[feature] trait ChiSqSelectorParams extends Params
    +  with HasFeaturesCol with HasOutputCol with HasLabelCol {
    +
    +  /**
    +   * Number of features that selector will select (ordered by statistic value descending). If the
    +   * number of features is < numTopFeatures, then this will select all features. The default value
    +   * of numTopFeatures is 50.
    +   * @group param
    +   */
    +  final val numTopFeatures = new IntParam(this, "numTopFeatures",
    +    "Number of features that selector will select, ordered by statistics value descending. If the" +
    +      " number of features is < numTopFeatures, then this will select all features.",
    +    ParamValidators.gtEq(1))
    +  setDefault(numTopFeatures -> 50)
    +
    +  /** @group getParam */
    +  def getNumTopFeatures: Int = $(numTopFeatures)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Compute the Chi-Square selector model given an `RDD` of `LabeledPoint` data.
    --- End diff --
    
    Update doc: "Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5742#discussion_r40964416
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.ml._
    +import org.apache.spark.ml.attribute.{AttributeGroup, _}
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.Identifiable
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/**
    + * Params for [[ChiSqSelector]] and [[ChiSqSelectorModel]].
    + */
    +private[feature] trait ChiSqSelectorParams extends Params
    +  with HasFeaturesCol with HasOutputCol with HasLabelCol {
    +
    +  /**
    +   * Number of features that selector will select (ordered by statistic value descending). If the
    +   * number of features is < numTopFeatures, then this will select all features. The default value
    +   * of numTopFeatures is 50.
    +   * @group param
    +   */
    +  final val numTopFeatures = new IntParam(this, "numTopFeatures",
    +    "Number of features that selector will select, ordered by statistics value descending. If the" +
    +      " number of features is < numTopFeatures, then this will select all features.",
    +    ParamValidators.gtEq(1))
    +  setDefault(numTopFeatures -> 50)
    +
    +  /** @group getParam */
    +  def getNumTopFeatures: Int = $(numTopFeatures)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Compute the Chi-Square selector model given an `RDD` of `LabeledPoint` data.
    + */
    +@Experimental
    +final class ChiSqSelector(override val uid: String)
    +  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams {
    +
    +  def this() = this(Identifiable.randomUID("chiSqSelector"))
    +
    +  /** @group setParam */
    +  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /** @group setParam */
    +  def setLabelCol(value: String): this.type = set(labelCol, value)
    +
    +  override def fit(dataset: DataFrame): ChiSqSelectorModel = {
    +    transformSchema(dataset.schema, logging = true)
    +    val input = dataset.select($(labelCol), $(featuresCol)).map {
    +      case Row(label: Double, features: Vector) =>
    +        LabeledPoint(label, features)
    +    }
    +    val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input)
    +    copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
    +  }
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType)
    +    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
    +  }
    +
    +  override def copy(extra: ParamMap): ChiSqSelector = defaultCopy(extra)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Model fitted by [[ChiSqSelector]].
    + */
    +@Experimental
    +final class ChiSqSelectorModel private[ml] (
    +    override val uid: String,
    +    private val chiSqSelector: feature.ChiSqSelectorModel)
    +  extends Model[ChiSqSelectorModel] with ChiSqSelectorParams {
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: DataFrame): DataFrame = {
    +    transformSchema(dataset.schema, logging = true)
    +    val newField = prepOutputField(dataset.schema)
    +    val selector = udf { chiSqSelector.transform _ }
    +    dataset.withColumn($(outputCol), selector(col($(featuresCol))), newField.metadata)
    +  }
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
    +    val newField = prepOutputField(schema)
    +    val outputFields = schema.fields :+ newField
    +    StructType(outputFields)
    +  }
    +
    +  /**
    +   * Prepare the output column field, including per-feature metadata.
    +   */
    +  private def prepOutputField(schema: StructType): StructField = {
    +    val selector = chiSqSelector.selectedFeatures.toSet
    +    val origAttrGroup = AttributeGroup.fromStructField(schema($(featuresCol)))
    +    val featureAttributes: Array[Attribute] = if (origAttrGroup.attributes.nonEmpty) {
    +      origAttrGroup.attributes.get.zipWithIndex.filter(x => selector.contains(x._2)).map(_._1)
    +    } else {
    +      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +    }
    +    val newAttributeGroup = new AttributeGroup($(outputCol), featureAttributes)
    +    newAttributeGroup.toStructField()
    +  }
    +
    +  override def copy(extra: ParamMap): ChiSqSelectorModel = {
    +    defaultCopy[ChiSqSelectorModel](extra).setParent(parent)
    --- End diff --
    
    This won't set the chiSqSelector field.  You'll need to construct a new instance manually; see StandardScalerModel for example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-143142324
  
    @jkbradley This test error is not caused by the code, pls retest it when possible. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5742#discussion_r40964408
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.ml._
    +import org.apache.spark.ml.attribute.{AttributeGroup, _}
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.Identifiable
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/**
    + * Params for [[ChiSqSelector]] and [[ChiSqSelectorModel]].
    + */
    +private[feature] trait ChiSqSelectorParams extends Params
    +  with HasFeaturesCol with HasOutputCol with HasLabelCol {
    +
    +  /**
    +   * Number of features that selector will select (ordered by statistic value descending). If the
    +   * number of features is < numTopFeatures, then this will select all features. The default value
    +   * of numTopFeatures is 50.
    +   * @group param
    +   */
    +  final val numTopFeatures = new IntParam(this, "numTopFeatures",
    +    "Number of features that selector will select, ordered by statistics value descending. If the" +
    +      " number of features is < numTopFeatures, then this will select all features.",
    +    ParamValidators.gtEq(1))
    +  setDefault(numTopFeatures -> 50)
    +
    +  /** @group getParam */
    +  def getNumTopFeatures: Int = $(numTopFeatures)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Compute the Chi-Square selector model given an `RDD` of `LabeledPoint` data.
    + */
    +@Experimental
    +final class ChiSqSelector(override val uid: String)
    +  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams {
    +
    +  def this() = this(Identifiable.randomUID("chiSqSelector"))
    +
    +  /** @group setParam */
    +  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /** @group setParam */
    +  def setLabelCol(value: String): this.type = set(labelCol, value)
    +
    +  override def fit(dataset: DataFrame): ChiSqSelectorModel = {
    +    transformSchema(dataset.schema, logging = true)
    +    val input = dataset.select($(labelCol), $(featuresCol)).map {
    +      case Row(label: Double, features: Vector) =>
    +        LabeledPoint(label, features)
    +    }
    +    val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input)
    +    copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
    +  }
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType)
    +    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
    +  }
    +
    +  override def copy(extra: ParamMap): ChiSqSelector = defaultCopy(extra)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Model fitted by [[ChiSqSelector]].
    + */
    +@Experimental
    +final class ChiSqSelectorModel private[ml] (
    +    override val uid: String,
    +    private val chiSqSelector: feature.ChiSqSelectorModel)
    +  extends Model[ChiSqSelectorModel] with ChiSqSelectorParams {
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: DataFrame): DataFrame = {
    +    transformSchema(dataset.schema, logging = true)
    +    val newField = prepOutputField(dataset.schema)
    --- End diff --
    
    You could get this field from the result of transformSchema, rather than recomputing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142370887
  
    @jkbradley I'll keep doing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-96992953
  
      [Test build #31134 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31134/consoleFull) for   PR 5742 at commit [`92fef9e`](https://github.com/apache/spark/commit/92fef9edaffff692d17a49685cb32b5947e17373).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class ChiSqSelector extends Estimator[ChiSqSelectorModel] with ChiSqSelectorBase `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144900648
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144896542
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5742#discussion_r40964382
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.ml._
    +import org.apache.spark.ml.attribute.{AttributeGroup, _}
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.Identifiable
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.sql._
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/**
    + * Params for [[ChiSqSelector]] and [[ChiSqSelectorModel]].
    + */
    +private[feature] trait ChiSqSelectorParams extends Params
    +  with HasFeaturesCol with HasOutputCol with HasLabelCol {
    +
    +  /**
    +   * Number of features that selector will select (ordered by statistic value descending). If the
    +   * number of features is < numTopFeatures, then this will select all features. The default value
    +   * of numTopFeatures is 50.
    +   * @group param
    +   */
    +  final val numTopFeatures = new IntParam(this, "numTopFeatures",
    +    "Number of features that selector will select, ordered by statistics value descending. If the" +
    +      " number of features is < numTopFeatures, then this will select all features.",
    +    ParamValidators.gtEq(1))
    +  setDefault(numTopFeatures -> 50)
    +
    +  /** @group getParam */
    +  def getNumTopFeatures: Int = $(numTopFeatures)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Compute the Chi-Square selector model given an `RDD` of `LabeledPoint` data.
    + */
    +@Experimental
    +final class ChiSqSelector(override val uid: String)
    +  extends Estimator[ChiSqSelectorModel] with ChiSqSelectorParams {
    +
    +  def this() = this(Identifiable.randomUID("chiSqSelector"))
    +
    +  /** @group setParam */
    +  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /** @group setParam */
    +  def setLabelCol(value: String): this.type = set(labelCol, value)
    +
    +  override def fit(dataset: DataFrame): ChiSqSelectorModel = {
    +    transformSchema(dataset.schema, logging = true)
    +    val input = dataset.select($(labelCol), $(featuresCol)).map {
    +      case Row(label: Double, features: Vector) =>
    +        LabeledPoint(label, features)
    +    }
    +    val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input)
    +    copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
    +  }
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType)
    +    SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
    +  }
    +
    +  override def copy(extra: ParamMap): ChiSqSelector = defaultCopy(extra)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Model fitted by [[ChiSqSelector]].
    + */
    +@Experimental
    +final class ChiSqSelectorModel private[ml] (
    +    override val uid: String,
    +    private val chiSqSelector: feature.ChiSqSelectorModel)
    +  extends Model[ChiSqSelectorModel] with ChiSqSelectorParams {
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    --- End diff --
    
    Add setLabelCol as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144896555
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144900651
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43166/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144839607
  
      [Test build #1835 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1835/consoleFull) for   PR 5742 at commit [`a7d983f`](https://github.com/apache/spark/commit/a7d983f38e26e20a690577dd158d4d42f1fd5682).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144851924
  
      [Test build #1835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1835/console) for   PR 5742 at commit [`a7d983f`](https://github.com/apache/spark/commit/a7d983f38e26e20a690577dd158d4d42f1fd5682).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class ChiSqSelector(override val uid: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-142499667
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-144897165
  
      [Test build #43166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43166/consoleFull) for   PR 5742 at commit [`f552028`](https://github.com/apache/spark/commit/f552028db9c18cf850ed934942d208692ec26ab9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-145091810
  
    LGTM, merging with master.  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6530][ML] Add chi-square selector for m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5742#issuecomment-96968693
  
      [Test build #31134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31134/consoleFull) for   PR 5742 at commit [`92fef9e`](https://github.com/apache/spark/commit/92fef9edaffff692d17a49685cb32b5947e17373).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org