You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dorx <gi...@git.apache.org> on 2014/08/02 05:50:40 UTC

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

GitHub user dorx opened a pull request:

    https://github.com/apache/spark/pull/1733

    [SPARK-2515][mllib] Chi Squared test

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dorx/spark chisquare

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1733.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1733
    
----
commit ff17423bd714592d38b69df426382838216cd133
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-25T19:31:35Z

    WIP

commit 6598379979e1ed69de6956ebf56ad0b7b47029bf
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-25T22:29:08Z

    API and code structure.

commit 706d436aea3db8b8cf15db0bcccb25e19c121a78
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-25T22:38:07Z

    Added API for RDD[Vector]

commit 3d615828a913b341c9fc7afe6e371f3950d591ab
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-25T22:54:23Z

    input names

commit e6b83f35375701f71f699697a83236e7e0c76d6c
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-01T20:33:04Z

    reviewer comments

commit 4e4e36199aa81d9d1628322c499e40556fbdc6ef
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-02T02:15:57Z

    WIP

commit 50703a57712ced5afbed4e2be73a268e7009c0c9
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-02T02:20:03Z

    merge master

commit bc7eb2eeba4e2ccf10b891e4ce59db55823cea3b
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-02T03:48:05Z

    unit passed; still need docs and some refactoring

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51853748
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18340/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16014291
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def degreesOfFreedom: DF
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def statistic: Double
    +
    +  /**
    +   * String explaining the hypothesis test result.
    +   * Specific classes implementing this trait should override this method to output test-specific
    +   * information.
    +   */
    +  override def toString: String = {
    +
    +    // String explaining what the p-value indicates.
    +    val pValueExplain = if (pValue <= 0.01) {
    +      "Very strong presumption against null hypothesis."
    +    } else if (0.01 < pValue && pValue <= 0.05) {
    +      "Strong presumption against null hypothesis."
    +    } else if (0.05 < pValue && pValue <= 0.01) {
    +      "Low presumption against null hypothesis."
    +    } else {
    +      "No presumption against null hypothesis."
    +    }
    +
    +    s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
    +    s"statistic = $statistic \n" +
    +    s"pValue = $pValue \n" + pValueExplain
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Object containing the test results for the chi squared hypothesis test.
    + */
    +@Experimental
    +case class ChiSquaredTestResult(override val pValue: Double,
    --- End diff --
    
    Btw, shall we rename it to `ChiSqTestResult`? So `chiSqTest() returns ChiSqTestResult`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16015802
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def degreesOfFreedom: DF
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def statistic: Double
    +
    +  /**
    +   * String explaining the hypothesis test result.
    +   * Specific classes implementing this trait should override this method to output test-specific
    +   * information.
    +   */
    +  override def toString: String = {
    +
    +    // String explaining what the p-value indicates.
    +    val pValueExplain = if (pValue <= 0.01) {
    +      "Very strong presumption against null hypothesis."
    +    } else if (0.01 < pValue && pValue <= 0.05) {
    +      "Strong presumption against null hypothesis."
    +    } else if (0.05 < pValue && pValue <= 0.01) {
    +      "Low presumption against null hypothesis."
    +    } else {
    +      "No presumption against null hypothesis."
    +    }
    +
    +    s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
    +    s"statistic = $statistic \n" +
    +    s"pValue = $pValue \n" + pValueExplain
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Object containing the test results for the chi squared hypothesis test.
    + */
    +@Experimental
    +case class ChiSquaredTestResult(override val pValue: Double,
    --- End diff --
    
    Whether correction is used or not can actually be reflected in the method name (`pearson` v `yates`). I doubt there's a lot of use cases for parsing the result back from JSON so let's not worry about it for now. The way I see case classes is that they're like data structs that encapsulates immutable fields (the list of fields can be modified in later releases given that this is all experimental), but if there are compiler optimization complications, I can change it to a regular class. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981559
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{DenseVector, Matrices, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.test.ChiSquaredTest
    +import org.apache.spark.mllib.util.LocalSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +
    +class HypothesisTestSuite extends FunSuite with LocalSparkContext {
    +
    +  test("chi squared pearson goodness of fit") {
    +
    +    val observed = new DenseVector(Array[Double](4, 6, 5))
    +    val pearson = Statistics.chiSqTest(observed)
    +
    +    // Results validated against the R command `chisq.test(c(4, 6, 5), p=c(1/3, 1/3, 1/3))`
    +    assert(pearson.statistic === 0.4)
    +    assert(pearson.degreesOfFreedom === 2)
    +    assert(pearson.pValue ~= 0.8187 relTol 1e-4)
    +    assert(pearson.method === ChiSquaredTest.PEARSON.name)
    +    assert(pearson.nullHypothesis === ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
    +
    +    // different expected and observed sum
    +    val observed1 = new DenseVector(Array[Double](21, 38, 43, 80))
    +    val expected1 = new DenseVector(Array[Double](3, 5, 7, 20))
    +    val pearson1 = Statistics.chiSqTest(observed1, expected1)
    +
    +    // Results validated against the R command
    +    // `chisq.test(c(21, 38, 43, 80), p=c(3/35, 1/7, 1/5, 4/7))`
    +    assert(pearson1.statistic ~= 14.1429 relTol 1e-4)
    +    assert(pearson1.degreesOfFreedom === 3)
    +    assert(pearson1.pValue ~= 0.002717 relTol 1e-4)
    +    assert(pearson1.method === ChiSquaredTest.PEARSON.name)
    +    assert(pearson1.nullHypothesis === ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
    +
    +    // SparseVector representation to make sure memory doesn't blow up
    --- End diff --
    
    Remove commented blocks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16009653
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def degreesOfFreedom: DF
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def statistic: Double
    +
    +  /**
    +   * String explaining the hypothesis test result.
    +   * Specific classes implementing this trait should override this method to output test-specific
    +   * information.
    +   */
    +  override def toString: String = {
    +
    +    // String explaining what the p-value indicates.
    +    val pValueExplain = if (pValue <= 0.01) {
    +      "Very strong presumption against null hypothesis."
    +    } else if (0.01 < pValue && pValue <= 0.05) {
    +      "Strong presumption against null hypothesis."
    +    } else if (0.05 < pValue && pValue <= 0.01) {
    +      "Low presumption against null hypothesis."
    +    } else {
    +      "No presumption against null hypothesis."
    +    }
    +
    +    s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
    +    s"statistic = $statistic \n" +
    +    s"pValue = $pValue \n" + pValueExplain
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Object containing the test results for the chi squared hypothesis test.
    + */
    +@Experimental
    +case class ChiSquaredTestResult(override val pValue: Double,
    --- End diff --
    
    Case class is a logical choice here since it's essentially an immutable object holding a bunch of invariant fields and doesn't do any stateful computations inside of the class. Is there development plan for extending this classes in the future?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51849419
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18333/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51570348
  
    @mengxr @jkbradley @falaki 
    In case you guys haven't noticed, the latest version implements the discussed APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024037
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    --- End diff --
    
    nit: `else__` -> `else_`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51290954
  
    I think we should either allow user to input the raw observations or use `Map[_, Long]` for input frequencies. I'm going to take a look at R's implementation ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-50953135
  
    QA results for PR 1733:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17744/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981444
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,211 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the Chi-squared test for the input RDDs using the specified method.
    --- End diff --
    
    `Chi` -> `chi`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981458
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def degreesOfFreedom: DF
    +
    +  /**
    +   *
    --- End diff --
    
    doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854474
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +90,76 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct the Chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    +   *
    +   * Note: the two input RDDs need to have the same number of partitions and the same number of
    +   * elements in each partition.
    +   *
    +   * @param observed RDD[Double] containing the observed counts.
    +   * @param expected RDD[Double] containing the expected counts. If the observed total differs from
    +   *                 the expected total, this RDD is rescaled to sum up to the observed total.
    +   * @param method String specifying the method to use for the Chi-squared test.
    +   *               Supported: `pearson` (default)
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSquared(observed: RDD[Double],
    --- End diff --
    
    `chiSqTest` sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51866151
  
    LGTM. Merged into both master and branch-1.1. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15857945
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,75 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + */
    +@Experimental
    +trait TestResult {
    +
    +  def pValue: Double
    +
    +  def degreesOfFreedom: Array[Long]
    --- End diff --
    
    `df` should be an array of double or we can make it a generic type. In t-test and f-test, `df` are not integers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981435
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +91,64 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    +   *
    +   * Note: the two input Vectors need to have the same size.
    +   *       `observed` cannot contain negative values.
    +   *       `expected` cannot contain nonpositive values.
    +   *
    +   * @param observed Vector containing the observed categorical counts/relative frequencies.
    +   * @param expected Vector containing the expected categorical counts/relative frequencies.
    +   *                 `expected` is rescaled if the `expected` sum differs from the `observed` sum.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(observed: Vector,
    --- End diff --
    
    the following style may be better:
    
    ~~~
    def chiSqTest(observed: Vector, expected: Vector): ChiSquaredTestResul =
      ChiSquaredTest.chiSquared(observed, expected)
    ~~~
    
    ~~~
    def chiSqTest(observed: Vector, expected: Vector): ChiSquaredTestResult = {
      ChiSquaredTest.chiSquared(observed, expected)
    }
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981456
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    --- End diff --
    
    doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16075494
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    +    val getExpected: (Int) => Double = if (expected.size == 0) {
    +      // Assume uniform distribution
    +      if (scale == 1.0) _ => 1.0 / size else _ => scale / size
    +    } else {
    +      if (scale == 1.0) (i: Int) => expected(i) else (i: Int) => scale * expected(i)
    +    }
    +
    +    // compute chi-squared statistic
    +    var statistic = 0.0
    +    var j = 0
    +    while (j < observed.size) {
    +      val obs = observed(j)
    +      if (obs != 0.0) {
    +        statistic += method.chiSqFunc(obs, getExpected(j))
    +      }
    +      j += 1
    +    }
    +    val df = size - 1
    +    val pValue = chiSquareComplemented(df, statistic)
    +    new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)
    +  }
    +
    +  /*
    +   * Pearon's independence test on the input contingency matrix.
    +   * TODO: optimize for SparseMatrix when it becomes supported.
    +   */
    +  def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {
    +    val method = methodFromString(methodName)
    +    val numRows = counts.numRows
    +    val numCols = counts.numCols
    +
    +    // get row and column sums
    +    val colSums = new Array[Double](numCols)
    +    val rowSums = new Array[Double](numRows)
    +    val colMajorArr = counts.toArray
    +    var i = 0
    +    while (i < colMajorArr.size) {
    +      val elem = colMajorArr(i)
    +      if (elem < 0.0) {
    +        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")
    +      }
    +      colSums(i / numRows) += elem
    +      rowSums(i % numRows) += elem
    +      i += 1
    +    }
    +    if (!colSums.forall(_ > 0.0) || !rowSums.forall(_ > 0.0)) {
    +      throw new IllegalArgumentException("Chi square statistic cannot be computed for input matrix "
    --- End diff --
    
    Since we're returning `statistic = Double.NaN` for when `expected = 0.0` for the GOF test, do we also want to do the same thing here instead of throwing an exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024033
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    --- End diff --
    
    Shall we return a test result rejecting the null instead of throw an exception? If the expected probability is zero but the observed count is not, then it is a clear clue to reject the null. This is what R returns:
    
    ~~~
    > chisq.test(c(5, 0, 3), p = c(0, 0.6, 0.4))
    
    	Chi-squared test for given probabilities
    
    data:  c(5, 0, 3)
    X-squared = Inf, df = 2, p-value < 2.2e-16
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854488
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,75 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + */
    +@Experimental
    +trait TestResult {
    +
    +  def pValue: Double
    +
    +  def degreesOfFreedom: Array[Long]
    +
    +  def statistic: Double
    --- End diff --
    
    ditto: doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981454
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    --- End diff --
    
    doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51541696
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18150/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981590
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{DenseVector, Matrices, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.test.ChiSquaredTest
    +import org.apache.spark.mllib.util.LocalSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +
    +class HypothesisTestSuite extends FunSuite with LocalSparkContext {
    +
    +  test("chi squared pearson goodness of fit") {
    +
    +    val observed = new DenseVector(Array[Double](4, 6, 5))
    +    val pearson = Statistics.chiSqTest(observed)
    +
    +    // Results validated against the R command `chisq.test(c(4, 6, 5), p=c(1/3, 1/3, 1/3))`
    +    assert(pearson.statistic === 0.4)
    +    assert(pearson.degreesOfFreedom === 2)
    +    assert(pearson.pValue ~= 0.8187 relTol 1e-4)
    --- End diff --
    
    `~=` -> `~==`. The latter tells more when something is wrong. (and please also update other places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854484
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,75 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + */
    +@Experimental
    +trait TestResult {
    +
    +  def pValue: Double
    --- End diff --
    
    documentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16085233
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala ---
    @@ -0,0 +1,221 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels: Map[Double, Int] = null
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels == null) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels =
    +          pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct.zipWithIndex.toMap
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct.zipWithIndex.toMap
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features(feature)
    +          val j = labels(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    if (size > 1000) {
    +      logWarning("Chi-squared approximation may not be accurate due to low expected frequencies "
    +        + s" as a result of a large number of categories: $size.")
    +    }
    +    val obsArr = observed.toArray
    +    val expArr = if (expected.size == 0) Array.tabulate(size)(_ => 1.0 / size) else expected.toArray
    +    if (!obsArr.forall(_ >= 0.0)) {
    +      throw new IllegalArgumentException("Negative entries disallowed in the observed vector.")
    +    }
    +    if (expected.size != 0 && ! expArr.forall(_ >= 0.0)) {
    +      throw new IllegalArgumentException("Negative entries disallowed in the expected vector.")
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val obsSum = obsArr.sum
    +    val expSum = if (expected.size == 0.0) 1.0 else expArr.sum
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else obsSum / expSum
    +
    +    // compute chi-squared statistic
    +    val statistic = obsArr.zip(expArr).foldLeft(0.0) { case (stat, (obs, exp)) =>
    +      if (exp == 0.0) {
    +        if (obs == 0.0) {
    +          throw new IllegalArgumentException("Chi-squared statistic undefined for input vectors due"
    +            + " to 0.0 values in both observed and expected.")
    +        } else {
    +          return new ChiSqTestResult(Double.PositiveInfinity, size - 1, Double.PositiveInfinity,
    --- End diff --
    
    p-value should be `0` here (strongly against the null)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51656796
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18217/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51844208
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18333/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-50953122
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17744/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854511
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,75 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + */
    +@Experimental
    +trait TestResult {
    +
    +  def pValue: Double
    +
    +  def degreesOfFreedom: Array[Long]
    +
    +  def statistic: Double
    +
    +  /**
    +   * String explaining the hypothesis test result.
    +   * Specific classes implementing this trait should override this method to output test-specific
    +   * information.
    +   */
    +  override def toString: String = {
    +
    +    val pValueExplain = if (pValue <= 0.01) {
    +      "Very strong presumption against null hypothesis."
    +    } else if (0.01 < pValue && pValue <= 0.05) {
    +      "Strong presumption against null hypothesis."
    +    } else if (0.05 < pValue && pValue <= 0.01) {
    +      "Low presumption against null hypothesis."
    +    } else {
    +      "No presumption against null hypothesis."
    +    }
    +
    +    s"degrees of freedom = ${degreesOfFreedom.mkString} \n" +
    --- End diff --
    
    `mkString("[", ",", "]")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854415
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +90,76 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct the Chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    +   *
    +   * Note: the two input RDDs need to have the same number of partitions and the same number of
    +   * elements in each partition.
    +   *
    +   * @param observed RDD[Double] containing the observed counts.
    +   * @param expected RDD[Double] containing the expected counts. If the observed total differs from
    +   *                 the expected total, this RDD is rescaled to sum up to the observed total.
    +   * @param method String specifying the method to use for the Chi-squared test.
    +   *               Supported: `pearson` (default)
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSquared(observed: RDD[Double],
    --- End diff --
    
    Shall we call it `chiSqTest` (following R's)? We need `test` in the method name because X_2 is also a distribution. I feel `chiSqTest` may be better than `chiSquaredTest` because it is also called `chi-square test` without `d`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16085437
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    +    val getExpected: (Int) => Double = if (expected.size == 0) {
    +      // Assume uniform distribution
    +      if (scale == 1.0) _ => 1.0 / size else _ => scale / size
    +    } else {
    +      if (scale == 1.0) (i: Int) => expected(i) else (i: Int) => scale * expected(i)
    +    }
    +
    +    // compute chi-squared statistic
    +    var statistic = 0.0
    +    var j = 0
    +    while (j < observed.size) {
    +      val obs = observed(j)
    +      if (obs != 0.0) {
    +        statistic += method.chiSqFunc(obs, getExpected(j))
    +      }
    +      j += 1
    +    }
    +    val df = size - 1
    +    val pValue = chiSquareComplemented(df, statistic)
    +    new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)
    +  }
    +
    +  /*
    +   * Pearon's independence test on the input contingency matrix.
    +   * TODO: optimize for SparseMatrix when it becomes supported.
    +   */
    +  def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {
    +    val method = methodFromString(methodName)
    +    val numRows = counts.numRows
    +    val numCols = counts.numCols
    +
    +    // get row and column sums
    +    val colSums = new Array[Double](numCols)
    +    val rowSums = new Array[Double](numRows)
    +    val colMajorArr = counts.toArray
    +    var i = 0
    +    while (i < colMajorArr.size) {
    +      val elem = colMajorArr(i)
    +      if (elem < 0.0) {
    +        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")
    +      }
    +      colSums(i / numRows) += elem
    +      rowSums(i % numRows) += elem
    +      i += 1
    +    }
    +    if (!colSums.forall(_ > 0.0) || !rowSums.forall(_ > 0.0)) {
    +      throw new IllegalArgumentException("Chi square statistic cannot be computed for input matrix "
    --- End diff --
    
    For this case, if there are empty rows or columns, both observed and expected are 0.0, so we should throw an exception.
    
    Btw, for the case when `expected = 0` and `observed > 0`, the result should be `statistics = Inf` and `pValue = 0.0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981447
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,211 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the Chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSquaredTest extends Logging {
    --- End diff --
    
    minor: `ChiSquaredTest` -> `ChiSqTest` (to match the public method names)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854487
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,75 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + */
    +@Experimental
    +trait TestResult {
    +
    +  def pValue: Double
    +
    +  def degreesOfFreedom: Array[Long]
    --- End diff --
    
    ditto: doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024028
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    --- End diff --
    
    `indexOf` runs in linear time. Shall we change `features` to a feature to index map?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16009835
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{DenseVector, Matrices, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.stat.test.ChiSquaredTest
    +import org.apache.spark.mllib.util.LocalSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +
    +class HypothesisTestSuite extends FunSuite with LocalSparkContext {
    +
    +  test("chi squared pearson goodness of fit") {
    +
    +    val observed = new DenseVector(Array[Double](4, 6, 5))
    +    val pearson = Statistics.chiSqTest(observed)
    +
    +    // Results validated against the R command `chisq.test(c(4, 6, 5), p=c(1/3, 1/3, 1/3))`
    +    assert(pearson.statistic === 0.4)
    +    assert(pearson.degreesOfFreedom === 2)
    +    assert(pearson.pValue ~= 0.8187 relTol 1e-4)
    +    assert(pearson.method === ChiSquaredTest.PEARSON.name)
    +    assert(pearson.nullHypothesis === ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
    +
    +    // different expected and observed sum
    +    val observed1 = new DenseVector(Array[Double](21, 38, 43, 80))
    +    val expected1 = new DenseVector(Array[Double](3, 5, 7, 20))
    +    val pearson1 = Statistics.chiSqTest(observed1, expected1)
    +
    +    // Results validated against the R command
    +    // `chisq.test(c(21, 38, 43, 80), p=c(3/35, 1/7, 1/5, 4/7))`
    +    assert(pearson1.statistic ~= 14.1429 relTol 1e-4)
    +    assert(pearson1.degreesOfFreedom === 3)
    +    assert(pearson1.pValue ~= 0.002717 relTol 1e-4)
    +    assert(pearson1.method === ChiSquaredTest.PEARSON.name)
    +    assert(pearson1.nullHypothesis === ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
    +
    +    // SparseVector representation to make sure memory doesn't blow up
    --- End diff --
    
    It's actually meant as a note to perf testers, but okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51662596
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18226/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51298178
  
    @dorx I checked R's implementation and finally figured out what is going on.
    
    1. When only a vector `x` is given, it is treated as a vector containing frequency counts for categories and tested against multinomial distribution.
    2. When a matrix `x` is given, it is treated as a contingency table and the test is for independence. 
    3. When both `x` and `y` are given, both vectors are treated as factors (categorical values) and the test is for independence.
    
    I want to suggest the following APIs:
    
    ~~~
    // test observed frequencies against multinomial distribution with
    // `p = (1/n, 1/n, ..., 1/n)`
    def chiSqTest(counts: Vector)
    
    // test observed frequencies against the given multinomial distribution
    def chiSqTest(counts: Vector, p: Vector)
    
    // test independence using the given contingency table 
    def chiSqTest(counts: Matrix)
    
    // test independence using the given observed pairs (assuming categorical values)
    def chiSqTest[V1, V2](observations: RDD[(V1, V2)])
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024698
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    +    val getExpected: (Int) => Double = if (expected.size == 0) {
    --- End diff --
    
    Okay. I'll simplify the logic here and log a warning if `observed.size` is too big, say `> 1000`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024040
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    +    val getExpected: (Int) => Double = if (expected.size == 0) {
    --- End diff --
    
    This adds complexity to the implementation. As mentioned above, it is not common to have many categories in a chi-square test. We can create the expected vector and use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51286506
  
    @mengxr @ jkbradley @falaki 
    PR ready for review now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1733


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51287008
  
    remove space between `@` and `jkbradley` ~ :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51657309
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18226/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024886
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    --- End diff --
    
    Right. But do we want to return a result with statistic and p-value = NaN or do we want to throw an exception in that case? My question was more around whether we want to have consistent behaviors for both cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854417
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +90,76 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct the Chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    +   *
    +   * Note: the two input RDDs need to have the same number of partitions and the same number of
    +   * elements in each partition.
    +   *
    +   * @param observed RDD[Double] containing the observed counts.
    +   * @param expected RDD[Double] containing the expected counts. If the observed total differs from
    +   *                 the expected total, this RDD is rescaled to sum up to the observed total.
    +   * @param method String specifying the method to use for the Chi-squared test.
    +   *               Supported: `pearson` (default)
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSquared(observed: RDD[Double],
    +      expected: RDD[Double],
    +      method: String): ChiSquaredTestResult = {
    +    ChiSquaredTest.chiSquared(observed, expected, method)
    +  }
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct the Chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    --- End diff --
    
    mention `pearson` here?
    
    minor: I think it should be fine to remove the rest of the doc and point users to the method with the full set of parameters, so we only maintain one copy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024030
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    --- End diff --
    
    ditto: use a map instead of linear lookup


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024027
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    --- End diff --
    
    could be initialized as a null


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51574118
  
    Verified test results with R and all good :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51347255
  
    The previous proposal may be hard to implement in Python. Another solution would be separate goodness-of-fit test from independence test, e.g., `chiSqGofTest` and `chiSqIndTest`.
    
    ~~~
    def chiSqGofTest(counts: Vector)
    
    def chiSqGofTest(counts: Vector, p: Vector)
    
    def chiSqIndTest(counts: Matrix)
    
    def chiSqIndTest[V1, V2](observations: RDD[(V1, V2)])
    ~~~
    
    We can also add direct RDD support, which may be unnecessary:
    
    ~~~
    def chiSqGofTest[V](observations: RDD[V], p: Map[V, Double])
    ~~~
    
    Since we only support `pearson`, we can hide `method` in the public API for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16026262
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    --- End diff --
    
    The first case is a valid test so we should return a result with `pValue = 0`. But the second case is not valid because 0/0 is undefined. I prefer throwing an exception in the second case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981438
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +91,64 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    +   *
    +   * Note: the two input Vectors need to have the same size.
    +   *       `observed` cannot contain negative values.
    +   *       `expected` cannot contain nonpositive values.
    +   *
    +   * @param observed Vector containing the observed categorical counts/relative frequencies.
    +   * @param expected Vector containing the expected categorical counts/relative frequencies.
    +   *                 `expected` is rescaled if the `expected` sum differs from the `observed` sum.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(observed: Vector,
    +      expected: Vector): ChiSquaredTestResult = ChiSquaredTest.chiSquared(observed, expected)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform
    +   * distribution, with each category having an expected frequency of `1 / observed.size`.
    +   *
    +   * Note: `observed` cannot contain negative values.
    +   *
    +   * @param observed Vector containing the observed categorical counts/relative frequencies.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(observed: Vector): ChiSquaredTestResult = ChiSquaredTest.chiSquared(observed)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's independence test on the input contingency matrix, which cannot contain
    +   * negative entries or columns or rows that sum up to 0.
    +   *
    +   * @param counts The contingency matrix.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(counts: Matrix): ChiSquaredTestResult = ChiSquaredTest.chiSquaredMatrix(counts)
    --- End diff --
    
    `counts` -> `observed`? This table could also be probabilities.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981441
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +91,64 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's chi-squared goodness of fit test of the observed data against the
    +   * expected distribution.
    +   *
    +   * Note: the two input Vectors need to have the same size.
    +   *       `observed` cannot contain negative values.
    +   *       `expected` cannot contain nonpositive values.
    +   *
    +   * @param observed Vector containing the observed categorical counts/relative frequencies.
    +   * @param expected Vector containing the expected categorical counts/relative frequencies.
    +   *                 `expected` is rescaled if the `expected` sum differs from the `observed` sum.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(observed: Vector,
    +      expected: Vector): ChiSquaredTestResult = ChiSquaredTest.chiSquared(observed, expected)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform
    +   * distribution, with each category having an expected frequency of `1 / observed.size`.
    +   *
    +   * Note: `observed` cannot contain negative values.
    +   *
    +   * @param observed Vector containing the observed categorical counts/relative frequencies.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(observed: Vector): ChiSquaredTestResult = ChiSquaredTest.chiSquared(observed)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's independence test on the input contingency matrix, which cannot contain
    +   * negative entries or columns or rows that sum up to 0.
    +   *
    +   * @param counts The contingency matrix.
    +   * @return ChiSquaredTest object containing the test statistic, degrees of freedom, p-value,
    +   *         the method used, and the null hypothesis.
    +   */
    +  @Experimental
    +  def chiSqTest(counts: Matrix): ChiSquaredTestResult = ChiSquaredTest.chiSquaredMatrix(counts)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct Pearson's independence test for every feature against the label across the input RDD.
    +   * For each feature, the (feature, label) pairs are converted into a contingency matrix for which
    +   * the chi-squared statistic is computed.
    +   *
    +   * @param data an `RDD[LabeledPoint]` containing the Labeled dataset.
    --- End diff --
    
    mention categorical here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15854426
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -89,4 +90,76 @@ object Statistics {
        */
       @Experimental
       def corr(x: RDD[Double], y: RDD[Double], method: String): Double = Correlations.corr(x, y, method)
    +
    +  /**
    +   * :: Experimental ::
    +   * Conduct the Chi-squared goodness of fit test of the observed data against the
    --- End diff --
    
    `Chi` -> `chi`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51857628
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18340/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981500
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def degreesOfFreedom: DF
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def statistic: Double
    +
    +  /**
    +   * String explaining the hypothesis test result.
    +   * Specific classes implementing this trait should override this method to output test-specific
    +   * information.
    +   */
    +  override def toString: String = {
    +
    +    // String explaining what the p-value indicates.
    +    val pValueExplain = if (pValue <= 0.01) {
    +      "Very strong presumption against null hypothesis."
    +    } else if (0.01 < pValue && pValue <= 0.05) {
    +      "Strong presumption against null hypothesis."
    +    } else if (0.05 < pValue && pValue <= 0.01) {
    +      "Low presumption against null hypothesis."
    +    } else {
    +      "No presumption against null hypothesis."
    +    }
    +
    +    s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
    +    s"statistic = $statistic \n" +
    +    s"pValue = $pValue \n" + pValueExplain
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Object containing the test results for the chi squared hypothesis test.
    + */
    +@Experimental
    +case class ChiSquaredTestResult(override val pValue: Double,
    --- End diff --
    
    Does it need to be a case class? Scala compiler will add many methods to a case class and make it very hard to extend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51286427
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17974/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51289600
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17975/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024043
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    +    val getExpected: (Int) => Double = if (expected.size == 0) {
    +      // Assume uniform distribution
    +      if (scale == 1.0) _ => 1.0 / size else _ => scale / size
    +    } else {
    +      if (scale == 1.0) (i: Int) => expected(i) else (i: Int) => scale * expected(i)
    +    }
    +
    +    // compute chi-squared statistic
    +    var statistic = 0.0
    +    var j = 0
    +    while (j < observed.size) {
    +      val obs = observed(j)
    +      if (obs != 0.0) {
    +        statistic += method.chiSqFunc(obs, getExpected(j))
    +      }
    +      j += 1
    +    }
    +    val df = size - 1
    +    val pValue = chiSquareComplemented(df, statistic)
    +    new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)
    +  }
    +
    +  /*
    +   * Pearon's independence test on the input contingency matrix.
    +   * TODO: optimize for SparseMatrix when it becomes supported.
    +   */
    +  def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {
    +    val method = methodFromString(methodName)
    +    val numRows = counts.numRows
    +    val numCols = counts.numCols
    +
    +    // get row and column sums
    +    val colSums = new Array[Double](numCols)
    +    val rowSums = new Array[Double](numRows)
    +    val colMajorArr = counts.toArray
    +    var i = 0
    +    while (i < colMajorArr.size) {
    +      val elem = colMajorArr(i)
    +      if (elem < 0.0) {
    +        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")
    +      }
    +      colSums(i / numRows) += elem
    +      rowSums(i % numRows) += elem
    +      i += 1
    +    }
    +    if (!colSums.forall(_ > 0.0) || !rowSums.forall(_ > 0.0)) {
    +      throw new IllegalArgumentException("Chi square statistic cannot be computed for input matrix "
    --- End diff --
    
    it may be nice to output the column index or row index here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024829
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    +          throw new IllegalArgumentException("Values in expected must be positive.")
    +        }
    +        expSum += exp
    +      }
    +      i += 1
    +    }
    +
    +    // Determine the scaling factor for expected
    +    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else  obsSum / expSum
    +    val getExpected: (Int) => Double = if (expected.size == 0) {
    --- End diff --
    
    sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51651306
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18217/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024877
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    --- End diff --
    
    It is undefined (because of 0/0) if both are zeros. But if observed > 0 and expected = 0, the statistic is observed / expected -> Inf and p-value should be 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r15981448
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,211 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the Chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSquaredTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSquaredTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSquaredTestResult](numCols)
    +    var labels = Array[Double]()
    +    var col = 0
    +    while (col < numCols) {
    --- End diff --
    
    This could be done in a single pass (or in batches if numCols is large):
    
    ~~~
    data.flatMap { p =>
      // assume dense vectors
      p.features.toArray.view.zipWithIndex { case (f, j) =>
        (j, p, f)
      }
    }.countByValue()
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51286717
  
    QA tests have started for PR 1733. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17975/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024031
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    --- End diff --
    
    We don't need to worry about this case. Having that many categories in chi-square tests is not common and it is against the assumption of chi-square test. 1000 is already very large.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16014267
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
    @@ -0,0 +1,88 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Trait for hypothesis test results.
    + * @tparam DF Return type of `degreesOfFreedom`
    + */
    +@Experimental
    +trait TestResult[DF] {
    +
    +  /**
    +   *
    +   */
    +  def pValue: Double
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def degreesOfFreedom: DF
    +
    +  /**
    +   *
    +   * @return
    +   */
    +  def statistic: Double
    +
    +  /**
    +   * String explaining the hypothesis test result.
    +   * Specific classes implementing this trait should override this method to output test-specific
    +   * information.
    +   */
    +  override def toString: String = {
    +
    +    // String explaining what the p-value indicates.
    +    val pValueExplain = if (pValue <= 0.01) {
    +      "Very strong presumption against null hypothesis."
    +    } else if (0.01 < pValue && pValue <= 0.05) {
    +      "Strong presumption against null hypothesis."
    +    } else if (0.05 < pValue && pValue <= 0.01) {
    +      "Low presumption against null hypothesis."
    +    } else {
    +      "No presumption against null hypothesis."
    +    }
    +
    +    s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
    +    s"statistic = $statistic \n" +
    +    s"pValue = $pValue \n" + pValueExplain
    +  }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Object containing the test results for the chi squared hypothesis test.
    + */
    +@Experimental
    +case class ChiSquaredTestResult(override val pValue: Double,
    --- End diff --
    
    No case class features are used, especially pattern matching. This case class will extend `Product5` and make it impossible to add a field, for example, whether correction is used or not. Also, with a case class, it is very hard to add a static method. We might want to write the test result to JSON and later parse it back. A natural choice would be `ChiSquaredTestResult.fromJSON(json: String)` but it is very complicated to match the type signature generated by Scala's compiler. We had this problem with `LabeledPoint` in MLlib, which is a public case class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1733#discussion_r16024688
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.test
    +
    +import breeze.linalg.{DenseMatrix => BDM}
    +import cern.jet.stat.Probability.chiSquareComplemented
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * Conduct the chi-squared test for the input RDDs using the specified method.
    + * Goodness-of-fit test is conducted on two `Vectors`, whereas test of independence is conducted
    + * on an input of type `Matrix` in which independence between columns is assessed.
    + * We also provide a method for computing the chi-squared statistic between each feature and the
    + * label for an input `RDD[LabeledPoint]`, return an `Array[ChiSquaredTestResult]` of size =
    + * number of features in the inpuy RDD.
    + *
    + * Supported methods for goodness of fit: `pearson` (default)
    + * Supported methods for independence: `pearson` (default)
    + *
    + * More information on Chi-squared test: http://en.wikipedia.org/wiki/Chi-squared_test
    + */
    +private[stat] object ChiSqTest extends Logging {
    +
    +  /**
    +   * @param name String name for the method.
    +   * @param chiSqFunc Function for computing the statistic given the observed and expected counts.
    +   */
    +  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
    +
    +  // Pearson's chi-squared test: http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    +  val PEARSON = new Method("pearson", (observed: Double, expected: Double) => {
    +    val dev = observed - expected
    +    dev * dev / expected
    +  })
    +
    +  // Null hypothesis for the two different types of chi-squared tests to be included in the result.
    +  object NullHypothesis extends Enumeration {
    +    type NullHypothesis = Value
    +    val goodnessOfFit = Value("observed follows the same distribution as expected.")
    +    val independence = Value("observations in each column are statistically independent.")
    +  }
    +
    +  // Method identification based on input methodName string
    +  private def methodFromString(methodName: String): Method = {
    +    methodName match {
    +      case PEARSON.name => PEARSON
    +      case _ => throw new IllegalArgumentException("Unrecognized method for Chi squared test.")
    +    }
    +  }
    +
    +  /**
    +   * Conduct Pearson's independence test for each feature against the label across the input RDD.
    +   * The contingency table is constructed from the raw (feature, label) pairs and used to conduct
    +   * the independence test.
    +   * Returns an array containing the ChiSquaredTestResult for every feature against the label.
    +   */
    +  def chiSquaredFeatures(data: RDD[LabeledPoint],
    +      methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
    +    val numCols = data.first().features.size
    +    val results = new Array[ChiSqTestResult](numCols)
    +    var labels = Array[Double]()
    +    // At most 100 columns at a time
    +    val batchSize = 100
    +    var batch = 0
    +    while (batch * batchSize < numCols) {
    +      // The following block of code can be cleaned up and made public as
    +      // chiSquared(data: RDD[(V1, V2)])
    +      val startCol = batch * batchSize
    +      val endCol = startCol + math.min(batchSize, numCols - startCol)
    +      val pairCounts = data.flatMap { p =>
    +        // assume dense vectors
    +        p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case (feature, col) =>
    +          (col, feature, p.label)
    +        }
    +      }.countByValue()
    +
    +      if (labels.size == 0) {
    +        // Do this only once for the first column since labels are invariant across features.
    +        labels = pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct
    +      }
    +      val numLabels = labels.size
    +      pairCounts.keys.groupBy(_._1).map { case (col, keys) =>
    +        val features = keys.map(_._2).toArray.distinct
    +        val numRows = features.size
    +        val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
    +        keys.foreach { case (_, feature, label) =>
    +          val i = features.indexOf(feature)
    +          val j = labels.indexOf(label)
    +          contingency(i, j) += pairCounts((col, feature, label))
    +        }
    +        results(col) = chiSquaredMatrix(Matrices.fromBreeze(contingency), methodName)
    +      }
    +      batch += 1
    +    }
    +    results
    +  }
    +
    +  /*
    +   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.
    +   * Uniform distribution is assumed when `expected` is not passed in.
    +   */
    +  def chiSquared(observed: Vector,
    +      expected: Vector = Vectors.dense(Array[Double]()),
    +      methodName: String = PEARSON.name): ChiSqTestResult = {
    +
    +    // Validate input arguments
    +    val method = methodFromString(methodName)
    +    if (expected.size != 0 && observed.size != expected.size) {
    +      throw new IllegalArgumentException("observed and expected must be of the same size.")
    +    }
    +    val size = observed.size
    +    // Avoid calling toArray on input vectors to avoid memory blow up
    +    // (esp if size = Int.MaxValue for a SparseVector).
    +    // Check positivity and collect sums
    +    var obsSum = 0.0
    +    var expSum = if (expected.size == 0.0) 1.0 else 0.0
    +    var i = 0
    +    while (i < size) {
    +      val obs = observed(i)
    +      if (obs < 0.0) {
    +        throw new IllegalArgumentException("Values in observed must be nonnegative.")
    +      }
    +      obsSum += obs
    +      if (expected.size > 0) {
    +        val exp = expected(i)
    +        if (exp <= 0.0) {
    --- End diff --
    
    What do we do if both observed and expected are 0?  R gives
    ```
    chisq.test(c(0, 0, 3), p = c(0, 0.6, 0.4))
    
    	Chi-squared test for given probabilities
    
    data:  c(0, 0, 3)
    X-squared = NaN, df = 2, p-value = NA
    ```
    even though the statistic is technically undefined in both cases. FWIW commons-math3 throws an exception for 0 values in `expected`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51545655
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18150/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1733#issuecomment-51289343
  
    QA results for PR 1733:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17974/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org