You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2017/03/01 02:36:13 UTC

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/17110

    [SPARK-19635][ML] DataFrame-based API for chi square test

    ## What changes were proposed in this pull request?
    
    Wrapper taking and return a DataFrame
    
    ## How was this patch tested?
    
    Copied unit tests from RDD-based API

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark df-hypotests

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17110
    
----
commit a9a8225162da4064714393d24ea601f1cd42753a
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2017-03-01T02:34:59Z

    DF-based api for chi square test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    I guess my only concern would be the testing is a bit sparse, but more tests can be added in the future (especially when the MLlib part is removed).  It seems it would be better to move more tests from ML -> MLlib over time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17110


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17110#discussion_r103813169
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquare.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.stat
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
    +import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
    +import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.col
    +
    +
    +/**
    + * :: Experimental ::
    + *
    + * Chi-square hypothesis testing for categorical data.
    + *
    + * See <a href="http://en.wikipedia.org/wiki/Chi-squared_test">Wikipedia</a> for more information
    + * on the Chi-squared test.
    + */
    +@Experimental
    +@Since("2.2.0")
    +object ChiSquare {
    +
    +  /** Used to construct output schema of tests */
    +  private case class ChiSquareResult(
    +      pValues: Vector,
    +      degreesOfFreedom: Array[Int],
    +      statistics: Vector)
    +
    +  /**
    +   * Conduct Pearson's independence test for every feature against the label across the input RDD.
    +   * For each feature, the (feature, label) pairs are converted into a contingency matrix for which
    +   * the Chi-squared statistic is computed. All label and feature values must be categorical.
    +   *
    +   * The null hypothesis is that the occurrence of the outcomes is statistically independent.
    +   *
    +   * @param dataset  DataFrame of categorical labels and categorical features.
    +   *                 Real-valued features will be treated as categorical for each distinct value.
    +   * @param featuresCol  Name of features column in dataset, of type `Vector` (`VectorUDT`)
    +   * @param labelCol  Name of label column in dataset, of any numerical type
    +   * @return DataFrame containing the test result for every feature against the label.
    +   *         This DataFrame will contain a single Row with the following fields:
    +   *          - `pValues: Vector`
    +   *          - `degreesOfFreedom: Array[Int]`
    +   *          - `statistics: Vector`
    +   *         Each of these fields has one value per feature.
    +   */
    +  @Since("2.2.0")
    +  def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame = {
    +    val spark = dataset.sparkSession
    +    import spark.implicits._
    +
    +    SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
    +    SchemaUtils.checkNumericType(dataset.schema, labelCol)
    +    val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)]
    +      .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) }
    +    val testResults = OldStatistics.chiSqTest(rdd)
    --- End diff --
    
    it would be nice to optimize this in the future -- since we have schema, if the label and features have been converted to categorical, we can get the unique values right away instead of having to re-generate the maps for distinct labels and features


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    OK merging with master
    Thanks @imatiach-msft  and @thunterdb !
    
    
    @imatiach-msft I agree about sparse testing.  This has all of the MLlib tests, but we should add more in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73644/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17110#discussion_r104220074
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquare.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.stat
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
    +import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
    +import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.col
    +
    +
    +/**
    + * :: Experimental ::
    + *
    + * Chi-square hypothesis testing for categorical data.
    + *
    + * See <a href="http://en.wikipedia.org/wiki/Chi-squared_test">Wikipedia</a> for more information
    + * on the Chi-squared test.
    + */
    +@Experimental
    +@Since("2.2.0")
    +object ChiSquare {
    +
    +  /** Used to construct output schema of tests */
    +  private case class ChiSquareResult(
    +      pValues: Vector,
    +      degreesOfFreedom: Array[Int],
    +      statistics: Vector)
    +
    +  /**
    +   * Conduct Pearson's independence test for every feature against the label across the input RDD.
    +   * For each feature, the (feature, label) pairs are converted into a contingency matrix for which
    +   * the Chi-squared statistic is computed. All label and feature values must be categorical.
    +   *
    +   * The null hypothesis is that the occurrence of the outcomes is statistically independent.
    +   *
    +   * @param dataset  DataFrame of categorical labels and categorical features.
    +   *                 Real-valued features will be treated as categorical for each distinct value.
    +   * @param featuresCol  Name of features column in dataset, of type `Vector` (`VectorUDT`)
    +   * @param labelCol  Name of label column in dataset, of any numerical type
    +   * @return DataFrame containing the test result for every feature against the label.
    +   *         This DataFrame will contain a single Row with the following fields:
    +   *          - `pValues: Vector`
    +   *          - `degreesOfFreedom: Array[Int]`
    +   *          - `statistics: Vector`
    +   *         Each of these fields has one value per feature.
    +   */
    +  @Since("2.2.0")
    +  def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame = {
    +    val spark = dataset.sparkSession
    +    import spark.implicits._
    +
    +    SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
    +    SchemaUtils.checkNumericType(dataset.schema, labelCol)
    --- End diff --
    
    Sounds reasonable, but let's do that in the future; this is already a lot more types than the RDD-based API supports.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    **[Test build #73644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73644/testReport)** for PR 17110 at commit [`a9a8225`](https://github.com/apache/spark/commit/a9a8225162da4064714393d24ea601f1cd42753a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    I just reversed my opinion about a shared "Statistics" object.  See https://github.com/apache/spark/pull/17108#issuecomment-285200613 for details.
    
    I pushed an update per your review @imatiach-msft 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17110#discussion_r104220081
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquare.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.stat
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
    +import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
    +import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.col
    +
    +
    +/**
    + * :: Experimental ::
    + *
    + * Chi-square hypothesis testing for categorical data.
    + *
    + * See <a href="http://en.wikipedia.org/wiki/Chi-squared_test">Wikipedia</a> for more information
    + * on the Chi-squared test.
    + */
    +@Experimental
    +@Since("2.2.0")
    +object ChiSquare {
    +
    +  /** Used to construct output schema of tests */
    +  private case class ChiSquareResult(
    +      pValues: Vector,
    +      degreesOfFreedom: Array[Int],
    +      statistics: Vector)
    +
    +  /**
    +   * Conduct Pearson's independence test for every feature against the label across the input RDD.
    +   * For each feature, the (feature, label) pairs are converted into a contingency matrix for which
    +   * the Chi-squared statistic is computed. All label and feature values must be categorical.
    +   *
    +   * The null hypothesis is that the occurrence of the outcomes is statistically independent.
    +   *
    +   * @param dataset  DataFrame of categorical labels and categorical features.
    +   *                 Real-valued features will be treated as categorical for each distinct value.
    +   * @param featuresCol  Name of features column in dataset, of type `Vector` (`VectorUDT`)
    +   * @param labelCol  Name of label column in dataset, of any numerical type
    +   * @return DataFrame containing the test result for every feature against the label.
    +   *         This DataFrame will contain a single Row with the following fields:
    +   *          - `pValues: Vector`
    +   *          - `degreesOfFreedom: Array[Int]`
    +   *          - `statistics: Vector`
    +   *         Each of these fields has one value per feature.
    +   */
    +  @Since("2.2.0")
    +  def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame = {
    +    val spark = dataset.sparkSession
    +    import spark.implicits._
    +
    +    SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
    +    SchemaUtils.checkNumericType(dataset.schema, labelCol)
    +    val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)]
    +      .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) }
    +    val testResults = OldStatistics.chiSqTest(rdd)
    --- End diff --
    
    Definitely; feel free to make a JIRA for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    **[Test build #74227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74227/testReport)** for PR 17110 at commit [`19fa02a`](https://github.com/apache/spark/commit/19fa02ad6d8cd73553cc804828e659918c6fa872).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    Actually, synced with @thunterdb and will update design doc to put everything under a "Statistics" object.  I'll wait until https://github.com/apache/spark/pull/17108 gets merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    @jkbradley LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    Ping @imatiach-msft  any more comments after the update?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    **[Test build #74227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74227/testReport)** for PR 17110 at commit [`19fa02a`](https://github.com/apache/spark/commit/19fa02ad6d8cd73553cc804828e659918c6fa872).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17110#discussion_r104220095
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/ChiSquareSuite.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.stat
    +
    +import java.util.Random
    +
    +import org.apache.spark.{SparkException, SparkFunSuite}
    +import org.apache.spark.ml.feature.LabeledPoint
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +
    +class ChiSquareSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("test DataFrame of labeled points") {
    +    // labels: 1.0 (2 / 6), 0.0 (4 / 6)
    +    // feature1: 0.5 (1 / 6), 1.5 (2 / 6), 3.5 (3 / 6)
    +    // feature2: 10.0 (1 / 6), 20.0 (1 / 6), 30.0 (2 / 6), 40.0 (2 / 6)
    +    val data = Seq(
    +      LabeledPoint(0.0, Vectors.dense(0.5, 10.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.5, 20.0)),
    +      LabeledPoint(1.0, Vectors.dense(1.5, 30.0)),
    +      LabeledPoint(0.0, Vectors.dense(3.5, 30.0)),
    +      LabeledPoint(0.0, Vectors.dense(3.5, 40.0)),
    +      LabeledPoint(1.0, Vectors.dense(3.5, 40.0)))
    +    for (numParts <- List(2, 4, 6, 8)) {
    +      val df = spark.createDataFrame(sc.parallelize(data, numParts))
    +      val chi = ChiSquare.test(df, "features", "label")
    +      val (pValues: Vector, degreesOfFreedom: Array[Int], statistics: Vector) =
    +        chi.select("pValues", "degreesOfFreedom", "statistics")
    +          .as[(Vector, Array[Int], Vector)].head()
    +      assert(pValues ~== Vectors.dense(0.6873, 0.6823) relTol 1e-4)
    +      assert(degreesOfFreedom === Array(2, 3))
    +      assert(statistics ~== Vectors.dense(0.75, 1.5) relTol 1e-4)
    +    }
    +  }
    +
    +  test("large number of features (SPARK-3087)") {
    +    // Test that the right number of results is returned
    +    val numCols = 1001
    +    val sparseData = Array(
    +      LabeledPoint(0.0, Vectors.sparse(numCols, Seq((100, 2.0)))),
    +      LabeledPoint(0.1, Vectors.sparse(numCols, Seq((200, 1.0)))))
    +    val df = spark.createDataFrame(sparseData)
    +    val chi = ChiSquare.test(df, "features", "label")
    +    val (pValues: Vector, degreesOfFreedom: Array[Int], statistics: Vector) =
    +      chi.select("pValues", "degreesOfFreedom", "statistics")
    +        .as[(Vector, Array[Int], Vector)].head()
    +    assert(pValues.size === numCols)
    +    assert(degreesOfFreedom.length === numCols)
    +    assert(statistics.size === numCols)
    +    assert(pValues(1000) !== null)  // SPARK-3087
    +  }
    +
    +  test("fail on continuous features or labels") {
    +    // Detect continuous features or labels
    +    val random = new Random(11L)
    +    val continuousLabel =
    +      Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2))))
    --- End diff --
    
    Good idea, done now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17110#discussion_r103813679
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/ChiSquareSuite.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.stat
    +
    +import java.util.Random
    +
    +import org.apache.spark.{SparkException, SparkFunSuite}
    +import org.apache.spark.ml.feature.LabeledPoint
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +
    +class ChiSquareSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("test DataFrame of labeled points") {
    +    // labels: 1.0 (2 / 6), 0.0 (4 / 6)
    +    // feature1: 0.5 (1 / 6), 1.5 (2 / 6), 3.5 (3 / 6)
    +    // feature2: 10.0 (1 / 6), 20.0 (1 / 6), 30.0 (2 / 6), 40.0 (2 / 6)
    +    val data = Seq(
    +      LabeledPoint(0.0, Vectors.dense(0.5, 10.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.5, 20.0)),
    +      LabeledPoint(1.0, Vectors.dense(1.5, 30.0)),
    +      LabeledPoint(0.0, Vectors.dense(3.5, 30.0)),
    +      LabeledPoint(0.0, Vectors.dense(3.5, 40.0)),
    +      LabeledPoint(1.0, Vectors.dense(3.5, 40.0)))
    +    for (numParts <- List(2, 4, 6, 8)) {
    +      val df = spark.createDataFrame(sc.parallelize(data, numParts))
    +      val chi = ChiSquare.test(df, "features", "label")
    +      val (pValues: Vector, degreesOfFreedom: Array[Int], statistics: Vector) =
    +        chi.select("pValues", "degreesOfFreedom", "statistics")
    +          .as[(Vector, Array[Int], Vector)].head()
    +      assert(pValues ~== Vectors.dense(0.6873, 0.6823) relTol 1e-4)
    +      assert(degreesOfFreedom === Array(2, 3))
    +      assert(statistics ~== Vectors.dense(0.75, 1.5) relTol 1e-4)
    +    }
    +  }
    +
    +  test("large number of features (SPARK-3087)") {
    +    // Test that the right number of results is returned
    +    val numCols = 1001
    +    val sparseData = Array(
    +      LabeledPoint(0.0, Vectors.sparse(numCols, Seq((100, 2.0)))),
    +      LabeledPoint(0.1, Vectors.sparse(numCols, Seq((200, 1.0)))))
    +    val df = spark.createDataFrame(sparseData)
    +    val chi = ChiSquare.test(df, "features", "label")
    +    val (pValues: Vector, degreesOfFreedom: Array[Int], statistics: Vector) =
    +      chi.select("pValues", "degreesOfFreedom", "statistics")
    +        .as[(Vector, Array[Int], Vector)].head()
    +    assert(pValues.size === numCols)
    +    assert(degreesOfFreedom.length === numCols)
    +    assert(statistics.size === numCols)
    +    assert(pValues(1000) !== null)  // SPARK-3087
    +  }
    +
    +  test("fail on continuous features or labels") {
    +    // Detect continuous features or labels
    +    val random = new Random(11L)
    +    val continuousLabel =
    +      Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2))))
    --- End diff --
    
    can the special value that is above the max categorical limit of 10000 be refactored to a constant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    LGTM!  nice addition :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17110: [SPARK-19635][ML] DataFrame-based API for chi squ...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17110#discussion_r103804058
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/ChiSquare.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.stat
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
    +import org.apache.spark.mllib.regression.{LabeledPoint => OldLabeledPoint}
    +import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.col
    +
    +
    +/**
    + * :: Experimental ::
    + *
    + * Chi-square hypothesis testing for categorical data.
    + *
    + * See <a href="http://en.wikipedia.org/wiki/Chi-squared_test">Wikipedia</a> for more information
    + * on the Chi-squared test.
    + */
    +@Experimental
    +@Since("2.2.0")
    +object ChiSquare {
    +
    +  /** Used to construct output schema of tests */
    +  private case class ChiSquareResult(
    +      pValues: Vector,
    +      degreesOfFreedom: Array[Int],
    +      statistics: Vector)
    +
    +  /**
    +   * Conduct Pearson's independence test for every feature against the label across the input RDD.
    +   * For each feature, the (feature, label) pairs are converted into a contingency matrix for which
    +   * the Chi-squared statistic is computed. All label and feature values must be categorical.
    +   *
    +   * The null hypothesis is that the occurrence of the outcomes is statistically independent.
    +   *
    +   * @param dataset  DataFrame of categorical labels and categorical features.
    +   *                 Real-valued features will be treated as categorical for each distinct value.
    +   * @param featuresCol  Name of features column in dataset, of type `Vector` (`VectorUDT`)
    +   * @param labelCol  Name of label column in dataset, of any numerical type
    +   * @return DataFrame containing the test result for every feature against the label.
    +   *         This DataFrame will contain a single Row with the following fields:
    +   *          - `pValues: Vector`
    +   *          - `degreesOfFreedom: Array[Int]`
    +   *          - `statistics: Vector`
    +   *         Each of these fields has one value per feature.
    +   */
    +  @Since("2.2.0")
    +  def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame = {
    +    val spark = dataset.sparkSession
    +    import spark.implicits._
    +
    +    SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
    +    SchemaUtils.checkNumericType(dataset.schema, labelCol)
    --- End diff --
    
    shouldn't chi square test work for binary type as well?  or we don't want to support that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    **[Test build #73644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73644/testReport)** for PR 17110 at commit [`a9a8225`](https://github.com/apache/spark/commit/a9a8225162da4064714393d24ea601f1cd42753a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    cool, I'll hold off on reviewing this for now then


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17110
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74227/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org