You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by avulanov <gi...@git.apache.org> on 2014/07/18 13:35:26 UTC

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

GitHub user avulanov opened a pull request:

    https://github.com/apache/spark/pull/1484

    [MLLIB] [WIP] SPARK-1473: Feature selection for high dimensional datasets

    The following is implemented: 
    1) generic traits for feature selection and filtering
    2) trait for feature selection of LabeledPoint with discrete data
    3) traits for calculation of contingency table and chi squared
    4) class for chi-squared feature selection
    5) tests for the above
    
    Needs some optimization in matrix operations.
    
    This request is a try to implement feature selection for MLLIB, the previous work by the issue author @izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/avulanov/spark featureselection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1484.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1484
    
----
commit 560dc08d7e2cbc191016a3ebbec1eb8146630bc7
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-08T08:25:57Z

    Chi Squared feature selection: initial version

commit 6a35bcf64ff9c71445dc48f8299f8f78a5e324d5
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-08T09:43:27Z

    Code style

commit dfb09fbf2732682d0b86afcbe02eb097e7d9c09e
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-09T10:06:54Z

    Feature selection filter

commit fa5fd1119c6cc0e2c48a74baed89d32c5a1b5a58
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-09T15:55:07Z

    Traits for FeatureSelection, CombinationsCalculator and FeatureFilter

commit 9a8f968ef07ee9a3cfc372d6e4d335d45ef5c065
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-11T09:14:29Z

    Feature selection redesign with vigdorchik

commit 099fb135e159407ae9acf0a1dcbaf23fbc5e781a
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-11T16:04:36Z

    Feature selector, fix of lazyness

commit 774b5ca9d4155315b388aae12e58d32b90c479fe
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-14T16:52:28Z

    Combinations and chi-squared values test

commit 43a1169687db70ea52753e7e86eccb55ed0bf43e
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-17T15:20:31Z

    Chi Squared by contingency table. Refactoring

commit 6890617e47f03278d08ea17929adf74dfa668230
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-18T09:41:05Z

    Scala style fix

commit 2565f6d9a24c892a6ed28ec1174b5a7077fd8c77
Author: Alexander Ulanov <na...@yandex.ru>
Date:   2014-07-18T11:10:48Z

    Tests, comments, apache headers and scala style

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72300660
  
      [Test build #26451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26451/consoleFull) for   PR 1484 at commit [`a6ad82a`](https://github.com/apache/spark/commit/a6ad82af0d81cfcba022fcaa1831153a1f2c36d9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51024568
  
    Sure. We had some transformers implemented under `mllib.feature`, similar to sk-learn's approach. For feature selection, we can follow the same approach if we view feature selection as transformation: 1) fit a dataset and select a subset of features, 2) transform a dataset by picking out selected features. So for the API, I suggest the following
    
    ~~~
    class ChiSquaredFeatureSelector(numFeatures: Int) extends Serializable {
      def fit(dataset: RDD[LabeledPoint]): this.type
      def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    } 
    ~~~
    
    and we can hide the implementation from public interfaces. Please let me know whether this sounds good to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-69372636
  
      [Test build #25329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25329/consoleFull) for   PR 1484 at commit [`57cbb48`](https://github.com/apache/spark/commit/57cbb489a3651e2f3a17abf5e57dbb05650e6af7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72528193
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26524/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62975462
  
      [Test build #23329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23329/consoleFull) for   PR 1484 at commit [`e972e07`](https://github.com/apache/spark/commit/e972e076ce1f96d9d437b6c759400c9f176f8d56).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887525
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    +          if(indices(i) == filterIndices(j)) {
    +            newIndices += j
    +            newValues += values(i)
    +            j += 1
    +            i += 1
    +          } else {
    +            if(indices(i) > filterIndices(j)) {
    +              j += 1
    +            } else {
    +              i += 1
    +            }
    +          }
    +        }
    +        /** Sparse representation might be ineffective if (newSize ~= newValues.size) */
    +        Vectors.sparse(newSize, newIndices.result(), newValues.result())
    +      case DenseVector(values) =>
    +        val values = features.toArray
    +        Vectors.dense(filterIndices.map(i => values(i)))
    +      case other =>
    +        throw new UnsupportedOperationException(
    +          s"Only sparse and dense vectors are supported but got ${other.getClass}.")
    +    }
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + * @param numTopFeatures number of features that selector will select
    + *                       (ordered by statistic value descending)
    + */
    +@Experimental
    +class ChiSqSelector (val numTopFeatures: Int) {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    --- End diff --
    
    It might be worth noting that the values are factors, as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L160


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-69383220
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25329/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72281109
  
    @avulanov Please check my inline comments on Compress and the Estimator/Model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72516656
  
    @mengxr Thank you for your comments! Done! Do you have any plans to add feature discretization capabilities to MLlib? There are few links in the head of this thread.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51086836
  
    @avulanov I have the same concern about calling `transform` before `fit`. There are two options: 1) throw an error, 2) fit on the same dataset and then transform (fit_transform in sk-learn). But I don't have a strong preference of either one.
    
    I want to add another candidate to what you proposed:
    
    ~~~
    class ChiSquaredFeatureSelection {
       def fit(dataset: RDD[LabeledPoint], numFeatures: Int): ChiSquaredFeatureSelector
    }
    
    class ChiSquaredFeatureSelector {
      def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    ~~~
    
    We can discuss the class hierarchy later since they are not user-facing.
    
    A problem with all the candidates here is we cannot apply the same transformation on `RDD[Vector]`, which is required for prediction. I'm thinking about something like the following:
    
    ~~~
    class ChiSquaredFeatureSelection {
       def fit[T <: Vectorized with Labeled](dataset: RDD[T], numFeatures: Int): ChiSquaredFeatureSelector
    }
    
    class ChiSquaredFeatureSelector {
      def transform[T <: Vectorized](dataset: RDD[T]): RDD[T]
    }
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672436
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    --- End diff --
    
    Since `Compress` is only used in `ChiSqSelector`, shall we move its implementation here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887530
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,69 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +
    +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  /*
    +   *  Contingency tables
    +   *  feature0 = {8.0, 0.0}
    +   *  class  0 1 2
    +   *    8.0||1|0|1|
    +   *    0.0||0|2|0|
    +   *
    +   *  feature1 = {7.0, 9.0}
    +   *  class  0 1 2
    +   *    7.0||1|0|0|
    +   *    9.0||0|2|1|
    +   *
    +   *  feature2 = {0.0, 6.0, 8.0, 5.0}
    +   *  class  0 1 2
    +   *    0.0||1|0|0|
    +   *    6.0||0|1|0|
    +   *    8.0||0|1|0|
    +   *    5.0||0|0|1|
    +   *
    +   *  Use chi-squared calculator from Internet
    +   */
    +
    +  test("ChiSqSelector transform test (sparse & dense vector)") {
    +    val labeledDiscreteData = sc.parallelize(
    +      Seq(new LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))),
    +        new LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))),
    +        new LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))),
    +        new LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))
    +      ), 2)
    +    val preFilteredData =
    +      Set(new LabeledPoint(0.0, Vectors.dense(Array(0.0))),
    +        new LabeledPoint(1.0, Vectors.dense(Array(6.0))),
    +        new LabeledPoint(1.0, Vectors.dense(Array(8.0))),
    +        new LabeledPoint(2.0, Vectors.dense(Array(5.0)))
    +      )
    --- End diff --
    
    minor: In Spark, we don't make a new line for the ending `)`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887521
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    --- End diff --
    
    space after `while`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23876732
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    --- End diff --
    
    The question is that it now lives under `mllib.feature.ChiSqSelector.scala`. We can certainly add this method as `Vector.slice` to `Vectors.scala`. But for this PR, it is fine to make it private. Once we find a home for it, we can expose it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672442
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + */
    +@Experimental
    +object ChiSqSelector {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   * @param numTopFeatures number of features that selector will select
    +   *                       (ordered by statistic value descending)
    +   */
    +  def fit(data: RDD[LabeledPoint], numTopFeatures: Int): ChiSqSelectorModel = {
    +    val (_, indices) = Statistics.chiSqTest(data).zipWithIndex.sortBy{ case(res, index) =>
    +      -res.statistic}.take(numTopFeatures).unzip
    +    new ChiSqSelectorModel(indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Filters features in a given vector
    + */
    +@Experimental
    +object Compress {
    +  /**
    +   * Returns a vector with features filtered
    +   * @param features vector
    +   * @param indexes indexes of features to filter
    +   */
    +  def apply(features: Vector, indexes: IndexedSeq[Int]): Vector = {
    +    val (values, _) =
    +      features.toArray.zipWithIndex.filter { case (value, index) =>
    +        indexes.contains(index)}.unzip
    --- End diff --
    
    This is not efficient: 1) `toArray` creates a dense array, 2) `indexes.contains` is O(n)-time. We can handle sparsity in a separate PR. For 2), we can do this
    
    ~~~
    val values = features.toArray
    Vector.dense(indices.map(i => values(i)))
    ~~~
          


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62650435
  
      [Test build #23232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23232/consoleFull) for   PR 1484 at commit [`f660728`](https://github.com/apache/spark/commit/f660728e16d5bab3efb74832cc3c0a81726fe563).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-69372316
  
    Rebased to let the test pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23883318
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,116 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + * @param numTopFeatures number of features that selector will select
    + *                       (ordered by statistic value descending)
    + */
    +@Experimental
    +class ChiSqSelector (val numTopFeatures: Int) {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   */
    +  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    +    val indices = Statistics.chiSqTest(data)
    +      .zipWithIndex.sortBy { case(res, _) => -res.statistic }
    +      .take(numTopFeatures)
    +      .map{ case(_, indices) => indices }
    +    new ChiSqSelectorModel(indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Filters features in a given vector
    + */
    +@Experimental
    +object Compress {
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * @param features vector
    +   * @param filterIndices indices of features to filter
    +   */
    +  def apply(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val filterMap = filterIndices.zipWithIndex.toMap
    --- End diff --
    
    Thank you for suggestion! Do you think it is ok to require indices to be sorted? I've put `.sorted` into `.fit` however another candidate is `.compress`. No "sorted" requirement is needed for the latter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887529
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,69 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +
    +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  /*
    +   *  Contingency tables
    +   *  feature0 = {8.0, 0.0}
    +   *  class  0 1 2
    +   *    8.0||1|0|1|
    +   *    0.0||0|2|0|
    +   *
    +   *  feature1 = {7.0, 9.0}
    +   *  class  0 1 2
    +   *    7.0||1|0|0|
    +   *    9.0||0|2|1|
    +   *
    +   *  feature2 = {0.0, 6.0, 8.0, 5.0}
    +   *  class  0 1 2
    +   *    0.0||1|0|0|
    +   *    6.0||0|1|0|
    +   *    8.0||0|1|0|
    +   *    5.0||0|0|1|
    +   *
    +   *  Use chi-squared calculator from Internet
    +   */
    +
    +  test("ChiSqSelector transform test (sparse & dense vector)") {
    +    val labeledDiscreteData = sc.parallelize(
    +      Seq(new LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))),
    --- End diff --
    
    Remove `new`. `LabeledPoint` is a case class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51033011
  
    @mengxr 
    1.  Do I understand correct, that you propose that `fit(dataset: RDD[LabeledPoint])` should compute feature scores according to the feature selection algorithm and `transform(dataset: RDD[LabeledPoint])` should return the filtered dataset?
    2.  It seems that such an interface allows misuse when someone calls `transform` before `fit`. In some sense it is similar to calling `predict` before actually learning the model. This is avoided in MLLib classification models implementation by means of `ClassificationModel` interface that has `predict` only. Individual classifier has object that returns its instance (that does training as well). I like this approach more because it is less error-prone from user prospective, but it is a little bit implicit from developer's prospective (you need to know that you need to implement a fabric). Long story short, why not to seal `fit` inside the constructor or inside the object?
    ```
    trait FeatureSelector extends Serializable {
       def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    //EITHER
    class ChiSquaredFeatureSelector(dataset: RDD[LabeledPoint], numFeatures: Int) extends FeatureSelector {
      // perform chi squared computations...
      // implement transform
       override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    // OR (like in classification models):
    class ChiSquaredFeatureSelector extends FeatureSelector {
       private def fit(dataset: RDD[LabeledPoint])
      // implement transform
       override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    object ChiSquaredFeatureSelector{
       def fit(dataset: RDD[LabeledPoint], numFeatures: Int) {
          val chi = new ChiSquaredFeatureSelector 
          chi.fit
          return chi
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-65732217
  
    @mengxr Could you suggest why the test fails?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51441325
  
    @mengxr 
    1)  Looks good, probably I should implement `LabeledPointTransformer` same as `VectorTransformer` and then implement `ChiSquared` that returns `ChiSquaredModel` after calling fit. The latter with extend both transformers. One thing only - I need to use a different method name for transform in `LabeledTransformer`.
    2)3)  Ok! 
    4)   I will test discretization in a few days as well. I need it for the project I am working on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62658046
  
      [Test build #23232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23232/consoleFull) for   PR 1484 at commit [`f660728`](https://github.com/apache/spark/commit/f660728e16d5bab3efb74832cc3c0a81726fe563).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)`
      * `trait FeatureSelection[T] extends java.io.Serializable `
      * `sealed trait FeatureFilter[T] extends FeatureSelection[T] `
      * `trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72528173
  
      [Test build #26524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26524/consoleFull) for   PR 1484 at commit [`755d358`](https://github.com/apache/spark/commit/755d358cece5fa09b912416e60ebd7c16f9be05b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSqSelectorModel (val selectedFeatures: Array[Int]) extends VectorTransformer `
      * `class ChiSqSelector (val numTopFeatures: Int) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72231302
  
    @mengxr I'll do the updates today


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72277684
  
      [Test build #26424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26424/consoleFull) for   PR 1484 at commit [`714b878`](https://github.com/apache/spark/commit/714b878037b70ae1d19d2fac72416a8382c77f1e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSqSelectorModel(indices: Array[Int]) extends VectorTransformer `
      * `class ChiSqSelector (val numTopFeatures: Int) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887527
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    +          if(indices(i) == filterIndices(j)) {
    +            newIndices += j
    +            newValues += values(i)
    +            j += 1
    +            i += 1
    +          } else {
    +            if(indices(i) > filterIndices(j)) {
    +              j += 1
    +            } else {
    +              i += 1
    +            }
    +          }
    +        }
    +        /** Sparse representation might be ineffective if (newSize ~= newValues.size) */
    +        Vectors.sparse(newSize, newIndices.result(), newValues.result())
    +      case DenseVector(values) =>
    +        val values = features.toArray
    +        Vectors.dense(filterIndices.map(i => values(i)))
    +      case other =>
    +        throw new UnsupportedOperationException(
    +          s"Only sparse and dense vectors are supported but got ${other.getClass}.")
    +    }
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + * @param numTopFeatures number of features that selector will select
    + *                       (ordered by statistic value descending)
    + */
    +@Experimental
    +class ChiSqSelector (val numTopFeatures: Int) {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   */
    +  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    +    val indices = Statistics.chiSqTest(data)
    +      .zipWithIndex.sortBy { case(res, _) => -res.statistic }
    --- End diff --
    
    space after `case`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672433
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    --- End diff --
    
    It should be okay to remove this line and use `Vector` directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72277691
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26424/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887522
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    +          if(indices(i) == filterIndices(j)) {
    --- End diff --
    
    space after `if`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672454
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.scalatest.FunSuite
    +
    +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  lazy val labeledDiscreteData = sc.parallelize(
    +    Seq( new LabeledPoint(0.0, Vectors.dense(Array(8.0, 7.0, 0.0))),
    +      new LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 6.0))),
    +      new LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))),
    +      new LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))
    +    ), 2)
    +
    +  /*
    +   *  Contingency tables
    +   *  feature0 = {8.0, 0.0}
    +   *  class  0 1 2
    +   *    8.0||1|0|1|
    +   *    0.0||0|2|0|
    +   *
    +   *  feature1 = {7.0, 9.0}
    +   *  class  0 1 2
    +   *    7.0||1|0|0|
    +   *    9.0||0|2|1|
    +   *
    +   *  feature2 = {0.0, 6.0, 8.0, 5.0}
    +   *  class  0 1 2
    +   *    0.0||1|0|0|
    +   *    6.0||0|1|0|
    +   *    8.0||0|1|0|
    +   *    5.0||0|0|1|
    +   *
    +   *  Use chi-squared calculator from Internet
    +   */
    +
    +  test("ChiSqSelector transform test") {
    +    val preFilteredData =
    +      Set( new LabeledPoint(0.0, Vectors.dense(Array(0.0))),
    +        new LabeledPoint(1.0, Vectors.dense(Array(6.0))),
    +        new LabeledPoint(1.0, Vectors.dense(Array(8.0))),
    +        new LabeledPoint(2.0, Vectors.dense(Array(5.0)))
    +      )
    +    val model = ChiSqSelector.fit(labeledDiscreteData, 1)
    +    val filteredData = labeledDiscreteData.map(lp =>
    +      new LabeledPoint(lp.label, model.transform(lp.features))).collect().toSet
    --- End diff --
    
    minor: If we cannot fit into a single line, it is common in Spark to use the following style:
    
    ~~~
        val filteredData = labeledDiscreteData.map { lp =>
          new LabeledPoint(lp.label, model.transform(lp.features))
        }.collect().toSet
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-49421587
  
    QA results for PR 1484:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)<br>trait FeatureSelection[T] extends java.io.Serializable {<br>sealed trait FeatureFilter[T] extends FeatureSelection[T] {<br>trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16826/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72298218
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26443/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62658049
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23232/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72138900
  
    @avulanov Just want to check with you whether you have time to update this PR before the feature freeze deadline (Feb 1st). Btw, the original JIRA was created for general feature selection. I made a new one specifically for chi-square selection (SPARK-5491). Could you update the PR title? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72515240
  
      [Test build #26524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26524/consoleFull) for   PR 1484 at commit [`755d358`](https://github.com/apache/spark/commit/755d358cece5fa09b912416e60ebd7c16f9be05b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51437331
  
    @mengxr Btw., discretization is needed for feature selection. Do you plan to merge this https://issues.apache.org/jira/browse/SPARK-1303 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72528593
  
    Yes, it would be nice to add feature discretization to MLlib. We had a PP, but as you've tried it doesn't scale well. I don't have concrete scalable algorithms in mind now. We can discuss more on the JIRA page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72302953
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26451/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72267073
  
      [Test build #26424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26424/consoleFull) for   PR 1484 at commit [`714b878`](https://github.com/apache/spark/commit/714b878037b70ae1d19d2fac72416a8382c77f1e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-69383209
  
      [Test build #25329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25329/consoleFull) for   PR 1484 at commit [`57cbb48`](https://github.com/apache/spark/commit/57cbb489a3651e2f3a17abf5e57dbb05650e6af7).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62820205
  
    No, `Statistics.chiSqTest` is public. Please check the link I mentioned above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-69123045
  
      [Test build #559 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/559/consoleFull) for   PR 1484 at commit [`e972e07`](https://github.com/apache/spark/commit/e972e076ce1f96d9d437b6c759400c9f176f8d56).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62982452
  
    @mengxr for some reason I cannot see the trace of the build, it seems that I need to login to Jenkins, but I don't have an account there


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62818823
  
    @mengxr `ChiSqTest` is private for `stat` package, I cannot access it from `feature` package: `private[stat] object ChiSqTest`. Should I change `[stat]` to `[mllib]` or should I make it public?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72298443
  
    Test failed due to unrelated exception


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672438
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + */
    +@Experimental
    +object ChiSqSelector {
    --- End diff --
    
    Make `ChiSqSelector` a class and `numTopFeatures` its parameter. This should be quite similar to, e.g., IDF: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L42


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887528
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    +          if(indices(i) == filterIndices(j)) {
    +            newIndices += j
    +            newValues += values(i)
    +            j += 1
    +            i += 1
    +          } else {
    +            if(indices(i) > filterIndices(j)) {
    +              j += 1
    +            } else {
    +              i += 1
    +            }
    +          }
    +        }
    +        /** Sparse representation might be ineffective if (newSize ~= newValues.size) */
    +        Vectors.sparse(newSize, newIndices.result(), newValues.result())
    +      case DenseVector(values) =>
    +        val values = features.toArray
    +        Vectors.dense(filterIndices.map(i => values(i)))
    +      case other =>
    +        throw new UnsupportedOperationException(
    +          s"Only sparse and dense vectors are supported but got ${other.getClass}.")
    +    }
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + * @param numTopFeatures number of features that selector will select
    + *                       (ordered by statistic value descending)
    + */
    +@Experimental
    +class ChiSqSelector (val numTopFeatures: Int) {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   */
    +  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    +    val indices = Statistics.chiSqTest(data)
    +      .zipWithIndex.sortBy { case(res, _) => -res.statistic }
    +      .take(numTopFeatures)
    +      .map{ case(_, indices) => indices }
    --- End diff --
    
    space before `{` and after `case`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672435
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    --- End diff --
    
    Maybe it is better to use `Array[Int]`, which is Java compatible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51168631
  
    @mengxr 
    1)  I also have concerns regarding the mentioned two options. Throwing an error means to have a method that returns an error when it is called with valid parameters. Calling `fit` inside `transform` will cause a question what the next `fit` call will do. 
    2)  Could you explain how the upper bound like `[T <: Vectorized with Labeled]` can be implemented? `LabeledPoint` is a case class with no class hierarchy or traits. 
    3)  It seems that all implementations of transform will do the same: filter features by index. I propose to implement such a filter. It also will solve the problem of filtering both `LabeledPoint` and `Vector`:
    ```
    trait FeatureFilter {
       val indices: Set[Int]
       def transform(RDD[LabeledPoint]: data) = data.map { lp => new LabeledPoint(lp.label, Compress(lp.features, indices)) }
       def transform(RDD[Vector]: data) = data.map { v => Compress(v, indices) }
    }
    
    object Compress {
     def apply(features: Vector, indexes: Set[Int]): Vector = {
        val (values, _) =
          features.toArray.zipWithIndex.filter { case (value, index) =>
            indexes.contains(index)}.unzip
        Vectors.dense(values.toArray)
      }
    }
    
    class ChiSquaredFeatureSelection(RDD[LabeledPoint]: data, Int: numFeatures) extends FeatureFilter {
       // compute chiSquared and return feature indices 
       featureIndices = {....}
    }
    
    
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51023586
  
    @mengxr Could you review or comment this? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-58975422
  
    @mengxr Sure! Thanks for suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672452
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.scalatest.FunSuite
    +
    +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  lazy val labeledDiscreteData = sc.parallelize(
    +    Seq( new LabeledPoint(0.0, Vectors.dense(Array(8.0, 7.0, 0.0))),
    +      new LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 6.0))),
    +      new LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))),
    +      new LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))
    +    ), 2)
    +
    +  /*
    +   *  Contingency tables
    +   *  feature0 = {8.0, 0.0}
    +   *  class  0 1 2
    +   *    8.0||1|0|1|
    +   *    0.0||0|2|0|
    +   *
    +   *  feature1 = {7.0, 9.0}
    +   *  class  0 1 2
    +   *    7.0||1|0|0|
    +   *    9.0||0|2|1|
    +   *
    +   *  feature2 = {0.0, 6.0, 8.0, 5.0}
    +   *  class  0 1 2
    +   *    0.0||1|0|0|
    +   *    6.0||0|1|0|
    +   *    8.0||0|1|0|
    +   *    5.0||0|0|1|
    +   *
    +   *  Use chi-squared calculator from Internet
    +   */
    +
    +  test("ChiSqSelector transform test") {
    +    val preFilteredData =
    +      Set( new LabeledPoint(0.0, Vectors.dense(Array(0.0))),
    --- End diff --
    
    Remove space after `(`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72528802
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72302951
  
      [Test build #26451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26451/consoleFull) for   PR 1484 at commit [`a6ad82a`](https://github.com/apache/spark/commit/a6ad82af0d81cfcba022fcaa1831153a1f2c36d9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSqSelector (val numTopFeatures: Int) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62821238
  
    Ok, thanks! Sorry, I didn't understand the API from the first sight :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-49421581
  
    QA tests have started for PR 1484. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16826/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62677686
  
    @avulanov We have ChiSq tests implemented under "mllib.stat.Statistics":
    
    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166
    
    Could you please call the method there and select top features based on the test statistics? This would make us have a simple place for ChiSq implementation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23876952
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + */
    +@Experimental
    +object ChiSqSelector {
    --- End diff --
    
    For this one, the static method is convenient. But others may take many parameters, where a static method becomes hard to maintain. To be consistent across MLlib, we use this Estimator/Model style. We can also embed statistics in the model later, which is also helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23861260
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + */
    +@Experimental
    +object ChiSqSelector {
    --- End diff --
    
    Done! However, why do you think it is better than having static function given that this class does nothing but storing an integer (same for IDF)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72528154
  
    LGTM pending Jenkins ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887524
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    +          if(indices(i) == filterIndices(j)) {
    +            newIndices += j
    +            newValues += values(i)
    +            j += 1
    +            i += 1
    +          } else {
    +            if(indices(i) > filterIndices(j)) {
    +              j += 1
    +            } else {
    +              i += 1
    +            }
    +          }
    +        }
    +        /** Sparse representation might be ineffective if (newSize ~= newValues.size) */
    --- End diff --
    
    Good point! Use a normal comment style `// TODO: Sparse ...`. `/** ... */` is for JavaDoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23877192
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,116 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + * @param numTopFeatures number of features that selector will select
    + *                       (ordered by statistic value descending)
    + */
    +@Experimental
    +class ChiSqSelector (val numTopFeatures: Int) {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   */
    +  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    +    val indices = Statistics.chiSqTest(data)
    +      .zipWithIndex.sortBy { case(res, _) => -res.statistic }
    +      .take(numTopFeatures)
    +      .map{ case(_, indices) => indices }
    +    new ChiSqSelectorModel(indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Filters features in a given vector
    + */
    +@Experimental
    +object Compress {
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * @param features vector
    +   * @param filterIndices indices of features to filter
    +   */
    +  def apply(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val filterMap = filterIndices.zipWithIndex.toMap
    --- End diff --
    
    This is slow due to hash map creation and hash lookups. Since both arrays are order, we can use the one-catch-another approach to extract indices, for example,
    
    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L344
    
    Btw, please use `ArrayBuilder` to build new index/value arrays, which doesn't have the boxing/unboxing issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-49429734
  
    QA results for PR 1484:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)<br>trait FeatureSelection[T] extends java.io.Serializable {<br>sealed trait FeatureFilter[T] extends FeatureSelection[T] {<br>trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16827/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62986690
  
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23329/console
    
    I saw `sbt/sbt mllib/test:compie` failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887517
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    --- End diff --
    
    organize imports (If you use idea intellij, there is a useful plugin: https://plugins.jetbrains.com/plugin/7350)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51438840
  
    1) https://github.com/apache/spark/pull/1814 separates model from algorithm.
    2) It is just an idea such that a feature transformer doesn't need to worry about other data in the instance.
    3) It won't work because both transform methods have the same type signature. This is also related to 2). We want to apply the filter on the vector and we don't care other data, labeled or not. I'm now working on the API design to make feature transformation easier. Let's discuss more after v1.1.
    
    Btw, I will re-visit the discretization PR after v1.1 to make sure it doesn't have performance issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-56999682
  
    @avulanov In 1.1, we have `chiSqTest` implemented in `mllib.stat.Statistics`. Could you update this PR using `chiSqTest`, implement `ChiSqSelector` under `mllib.feature`, similar to `StandardScaler` and `IDF`? Thanks!
    
    For the transformer name, `chiSquared` is a heavily overloaded term. So I suggest `ChiSqSelector` to be more precise. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62656191
  
    @mengxr Just to clarify: I'll implement `ChiSqSelector` with the method `fit(data: RDD[LabeledPoint]): ChiSqSelectorModel`. The latter `extends VectorTransformer`. Right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62982138
  
      [Test build #23329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23329/consoleFull) for   PR 1484 at commit [`e972e07`](https://github.com/apache/spark/commit/e972e076ce1f96d9d437b6c759400c9f176f8d56).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72298217
  
      [Test build #26443 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26443/consoleFull) for   PR 1484 at commit [`a6ad82a`](https://github.com/apache/spark/commit/a6ad82af0d81cfcba022fcaa1831153a1f2c36d9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSqSelector (val numTopFeatures: Int) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887520
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    --- End diff --
    
    remove `: Int` (because this is a local var and it is clearly an `Int`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887545
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,116 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + * @param numTopFeatures number of features that selector will select
    + *                       (ordered by statistic value descending)
    + */
    +@Experimental
    +class ChiSqSelector (val numTopFeatures: Int) {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   */
    +  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    +    val indices = Statistics.chiSqTest(data)
    +      .zipWithIndex.sortBy { case(res, _) => -res.statistic }
    +      .take(numTopFeatures)
    +      .map{ case(_, indices) => indices }
    +    new ChiSqSelectorModel(indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Filters features in a given vector
    + */
    +@Experimental
    +object Compress {
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * @param features vector
    +   * @param filterIndices indices of features to filter
    +   */
    +  def apply(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val filterMap = filterIndices.zipWithIndex.toMap
    --- End diff --
    
    Please see my new inline comments about the constructor. Basically, if we expose the constructor, we should check the ordering of `indices`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672451
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.scalatest.FunSuite
    +
    +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  lazy val labeledDiscreteData = sc.parallelize(
    +    Seq( new LabeledPoint(0.0, Vectors.dense(Array(8.0, 7.0, 0.0))),
    --- End diff --
    
    Remove space after `(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672440
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Creates a ChiSquared feature selector.
    + */
    +@Experimental
    +object ChiSqSelector {
    +
    +  /**
    +   * Returns a ChiSquared feature selector.
    +   *
    +   * @param data data used to compute the Chi Squared statistic.
    +   * @param numTopFeatures number of features that selector will select
    +   *                       (ordered by statistic value descending)
    +   */
    +  def fit(data: RDD[LabeledPoint], numTopFeatures: Int): ChiSqSelectorModel = {
    +    val (_, indices) = Statistics.chiSqTest(data).zipWithIndex.sortBy{ case(res, index) =>
    --- End diff --
    
    It might be easier to read if we use multiple lines:
    
    ~~~
    val indices = Statistics.chiSqTest(data)
      .zipWithIndex
      .sortBy(-_._1.statistic)
      .take(numTopFeatures)
      .map(_._2)
      .toArray
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-62982146
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23329/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672450
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.scalatest.FunSuite
    +
    +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  lazy val labeledDiscreteData = sc.parallelize(
    --- End diff --
    
    This may be risky because `sc` is initialized in `beforeAll`. We can either move it to the test closure or initialize it in `beforeAll`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-49421907
  
    QA tests have started for PR 1484. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16827/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1484


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887518
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    --- End diff --
    
    Remove one space before `private`. It should be okay to expose the constructor as well as `indices`. Otherwise, user won't be able to save/load a model and there is no easy way to obtain `indices`. We need to check the ordering of `indices` in the constructor then. For the name, is it clear that `indices` means selected feature indices under this context? `selectedFeatures` may sound better, which we can put under a feature selector trait. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-69126401
  
      [Test build #559 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/559/consoleFull) for   PR 1484 at commit [`e972e07`](https://github.com/apache/spark/commit/e972e076ce1f96d9d437b6c759400c9f176f8d56).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by avulanov <gi...@git.apache.org>.

Github user avulanov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23870580
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg
    +import org.apache.spark.mllib.linalg.{Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter)
    + */
    +@Experimental
    +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: linalg.Vector): linalg.Vector = {
    +    Compress(vector, indices)
    --- End diff --
    
    I though it would be useful in general for filtering features. Does it make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23672445
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.scalatest.FunSuite
    --- End diff --
    
    organize imports into groups: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72300633
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-72294707
  
      [Test build #26443 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26443/consoleFull) for   PR 1484 at commit [`a6ad82a`](https://github.com/apache/spark/commit/a6ad82af0d81cfcba022fcaa1831153a1f2c36d9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1484#discussion_r23887523
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.Statistics
    +import org.apache.spark.rdd.RDD
    +
    +import scala.collection.mutable.ArrayBuilder
    +
    +/**
    + * :: Experimental ::
    + * Chi Squared selector model.
    + *
    + * @param indices list of indices to select (filter). Must be ordered asc
    + */
    +@Experimental
    +class ChiSqSelectorModel  private[mllib] (indices: Array[Int]) extends VectorTransformer {
    +  /**
    +   * Applies transformation on a vector.
    +   *
    +   * @param vector vector to be transformed.
    +   * @return transformed vector.
    +   */
    +  override def transform(vector: Vector): Vector = {
    +    compress(vector, indices)
    +  }
    +
    +  /**
    +   * Returns a vector with features filtered.
    +   * Preserves the order of filtered features the same as their indices are stored.
    +   * Might be moved to Vector as .slice
    +   * @param features vector
    +   * @param filterIndices indices of features to filter, must be ordered asc
    +   */
    +  private def compress(features: Vector, filterIndices: Array[Int]): Vector = {
    +    features match {
    +      case SparseVector(size, indices, values) =>
    +        val newSize = filterIndices.length
    +        val newValues = new ArrayBuilder.ofDouble
    +        val newIndices = new ArrayBuilder.ofInt
    +        var i: Int = 0
    +        var j: Int = 0
    +        while(i < indices.length && j < filterIndices.length) {
    +          if(indices(i) == filterIndices(j)) {
    +            newIndices += j
    +            newValues += values(i)
    +            j += 1
    +            i += 1
    +          } else {
    +            if(indices(i) > filterIndices(j)) {
    --- End diff --
    
    space after `if`. For performance, we can store `indices(i)` and `filterIndices(j)` to avoid look up twice


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org