You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by erikerlandson <gi...@git.apache.org> on 2016/06/01 15:38:30 UTC

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

GitHub user erikerlandson opened a pull request:

    https://github.com/apache/spark/pull/13440

    [SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality

    ## What changes were proposed in this pull request?
    Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.
    
    https://issues.apache.org/jira/browse/SPARK-15699
    
    http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/
    
    ## How was this patch tested?
    I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/erikerlandson/spark chisquared_split_quality

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13440
    
----
commit 04c131647f9cbc8208b755c6d08f16a1056f171c
Author: Erik Erlandson <ee...@redhat.com>
Date:   2016-05-31T21:44:22Z

    Implement a Chi-Squared test statistic option for measuring split quality when training decision trees

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59751/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104812468
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -50,6 +50,50 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    scala: do not add a default implementation, it causes issues with java compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb patch is clean again.  @HyukjinKwon found my problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @HyukjinKwon @thunterdb can you all take a look at this?  It's been under review for quite a long time!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @srowen I discuss some of these questions in the [blog post](http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/), but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary."  Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test.  Most such tests also intrinsically take into account decreasing population sizes.  As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold.
    
    On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218246415
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
    @@ -71,6 +71,23 @@ object Gini extends Impurity {
       @Since("1.1.0")
       def instance: this.type = this
     
    +  /**
    +   * :: DeveloperApi ::
    +   * p-values for test-statistic measures, unsupported for [[Gini]]
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #64309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64309/consoleFull)** for PR 13440 at commit [`6d38cfd`](https://github.com/apache/spark/commit/6d38cfd8d68be9cf2fba5901a8a2bb205c0fbd01).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75827/testReport)** for PR 13440 at commit [`5cbb21c`](https://github.com/apache/spark/commit/5cbb21c7a4c1443026d58d8168322ebe2d5c6160).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75944/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75868/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75944/testReport)** for PR 13440 at commit [`d206c6b`](https://github.com/apache/spark/commit/d206c6bdb763d2e3e398c8fac21b03f2eb2d445e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104813302
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala ---
    @@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite
         compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], numClasses)
       }
     
    +  test("split quality using chi-squared and minimum gain") {
    +    // Generate a data set where the 1st feature is useful and the others are noise
    +    val features = Vector.fill(200) {
    +      Array.fill(3) { scala.util.Random.nextInt(2).toDouble }
    +    }
    +    val labels = features.map { fv =>
    +      LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv))
    +    }
    +    val rdd = sc.parallelize(labels)
    +
    +    // two-class learning problem
    +    val numClasses = 2
    +    // all binary features
    +    val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 2) } : _*)
    +
    +    // Chi-squared split quality with a p-value threshold of 0.01 should allow
    +    // only the first feature to be used since the others are uncorrelated noise
    +    val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, numClasses)
    +    val dt = new DecisionTreeClassifier()
    +      .setImpurity("chisquared")
    +      .setMaxDepth(5)
    +      .setMinInfoGain(0.01)
    +    val treeModel = dt.fit(train)
    +
    +    // The tree should use exactly one of the 3 features: featue(0)
    --- End diff --
    
    nit: feature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #64309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64309/consoleFull)** for PR 13440 at commit [`6d38cfd`](https://github.com/apache/spark/commit/6d38cfd8d68be9cf2fba5901a8a2bb205c0fbd01).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75827/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218997600
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
    @@ -71,6 +71,23 @@ object Gini extends Impurity {
       @Since("1.1.0")
       def instance: this.type = this
     
    +  /**
    +   * :: DeveloperApi ::
    +   * p-values for test-statistic measures, unsupported for [[Gini]]
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    @srowen @willb 
    I cached the design of the metrics back in. In general, `Impurity` already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation.
    
    My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using _only_ the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children.
    
    Another kink in the current design is that `ImpurityCalculator` is essentially parallel to `Impurity`, and in fact `ImpurityCalculator#calculate()` is how impurity measures are currently requested. `Impurity` seems somewhat redundant, and might be factored out in favor of `ImpurityCalculator`. The current signature `calculate()` might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}.
    
    Calls to `calculate()` are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks.
    
    That kind of change would also be API breaking and so a good target for 3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by jsigee87 <gi...@git.apache.org>.
Github user jsigee87 commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Is this still being considered?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #80093 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80093/testReport)** for PR 13440 at commit [`bb2f660`](https://github.com/apache/spark/commit/bb2f6607ef3d7921e9e2320da2d59dd068863a92).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75923/testReport)** for PR 13440 at commit [`4e367b1`](https://github.com/apache/spark/commit/4e367b1eb0fcbccecfebf912a6e6cd56ebf99c37).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66756/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    rebased to latest head of master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80093/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb this is ready for any further review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75923/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by wangmiao1981 <gi...@git.apache.org>.
Github user wangmiao1981 commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb Can you take a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #66756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66756/consoleFull)** for PR 13440 at commit [`83f5e83`](https://github.com/apache/spark/commit/83f5e83fb87407bdd7dc8d740fba6fb30d1da3aa).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73006/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64309/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    This is a re-submission of #13438 to fix target branch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #59740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59740/consoleFull)** for PR 13440 at commit [`04c1316`](https://github.com/apache/spark/commit/04c131647f9cbc8208b755c6d08f16a1056f171c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104822458
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -50,6 +50,50 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    IIRC, there were other impurity measures using this 'unsupported' idiom.   However, I'm fine with making it abstract, if that's the direction you are thinking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #66756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66756/consoleFull)** for PR 13440 at commit [`83f5e83`](https://github.com/apache/spark/commit/83f5e83fb87407bdd7dc8d740fba6fb30d1da3aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95018/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Ready for review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by shaneknapp <gi...@git.apache.org>.
Github user shaneknapp commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    i stopped the build as i need to restart jenkins...  i'll retrigger this when we're back up and running.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66679/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #59751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59751/consoleFull)** for PR 13440 at commit [`6d38cfd`](https://github.com/apache/spark/commit/6d38cfd8d68be9cf2fba5901a8a2bb205c0fbd01).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76041/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #73006 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73006/testReport)** for PR 13440 at commit [`61cbf7c`](https://github.com/apache/spark/commit/61cbf7c5aafe4cc2c59daf0b058e701a8545f4eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #95018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95018/testReport)** for PR 13440 at commit [`bb2f660`](https://github.com/apache/spark/commit/bb2f6607ef3d7921e9e2320da2d59dd068863a92).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @srowen @thunterdb any more thoughts on this?
    how about @sethah @yanboliang @jkbradley?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104821445
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,162 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]]
    + * during binary classification.
    + */
    +@Since("2.0.0")
    +@Experimental
    +object ChiSquared extends Impurity {
    +  private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest()
    --- End diff --
    
    IIRC it was to "allocate on the stack"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75923/testReport)** for PR 13440 at commit [`4e367b1`](https://github.com/apache/spark/commit/4e367b1eb0fcbccecfebf912a6e6cd56ebf99c37).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75836/testReport)** for PR 13440 at commit [`6fea64d`](https://github.com/apache/spark/commit/6fea64ddee6b89f0a823f218a3d0119ced7fcc2f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #94042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94042/testReport)** for PR 13440 at commit [`bb2f660`](https://github.com/apache/spark/commit/bb2f6607ef3d7921e9e2320da2d59dd068863a92).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    update - I'm consulting with some teammates about what it might mean to also support Bayesian variations on split quality, since there has been a lot of interest in the last few years regarding Bayesian alternatives to more traditional p-value based statistics.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94042/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    I don't have ML knowledge enough to review this. I can cc ML committer guys who I can guess have some expertise from git blame but I hope there are some sign-offs left here from some guys here ahead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75782 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75782/testReport)** for PR 13440 at commit [`6762a18`](https://github.com/apache/spark/commit/6762a18dd558b61d5b292787115d6e8c8768ed12).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104812803
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -50,6 +50,50 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    you cannot guarantee compatibility with existing code here, since you would break the bytecode either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #76041 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76041/testReport)** for PR 13440 at commit [`d2a2381`](https://github.com/apache/spark/commit/d2a2381edc7d2b2c395f0e0049e152190cd15827).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #59745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59745/consoleFull)** for PR 13440 at commit [`1136518`](https://github.com/apache/spark/commit/1136518f81be64951044d5f9f28f15b12b2bc0a6).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218252461
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -52,6 +52,49 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Determine if this impurity measure is a test-statistic measure
    +   * @return True if this is a split quality measure based on a test statistic (i.e. returns a
    +   * p-value) or false otherwise.
    +   * @note Unless overridden this method returns false by default, for backward compatability
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def isTestStatistic: Boolean
    --- End diff --
    
    Can this be customized or extended externally to spark?  I'm wondering why it is public.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75782/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75812/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75944/testReport)** for PR 13440 at commit [`d206c6b`](https://github.com/apache/spark/commit/d206c6bdb763d2e3e398c8fac21b03f2eb2d445e).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75827/testReport)** for PR 13440 at commit [`5cbb21c`](https://github.com/apache/spark/commit/5cbb21c7a4c1443026d58d8168322ebe2d5c6160).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #59751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59751/consoleFull)** for PR 13440 at commit [`6d38cfd`](https://github.com/apache/spark/commit/6d38cfd8d68be9cf2fba5901a8a2bb205c0fbd01).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75812/testReport)** for PR 13440 at commit [`6762a18`](https://github.com/apache/spark/commit/6762a18dd558b61d5b292787115d6e8c8768ed12).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #66679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66679/consoleFull)** for PR 13440 at commit [`b199ae3`](https://github.com/apache/spark/commit/b199ae3b39cda2147fb44f0e788c3be36c8dcf6a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #59745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59745/consoleFull)** for PR 13440 at commit [`1136518`](https://github.com/apache/spark/commit/1136518f81be64951044d5f9f28f15b12b2bc0a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb not sure what is failing in the CI.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @holdenk @jkbradley looks like it's clean again


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @holdenk yes, I'll rebase it this week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look?  It's been open for quite a while now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75868/testReport)** for PR 13440 at commit [`a75a01b`](https://github.com/apache/spark/commit/a75a01b34cc0288b77309ed83a20569b86cb69db).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    I've not seen chi squared used as a split statistic; when is it theoretically better than entropy? Or something a bit more fundamental like KL divergence? It makes some sense but does require some assumption about the data


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66766/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218208383
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,162 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating Chi Squared as a split quality metric during binary classification.
    + */
    +@Since("2.2.0")
    +@Experimental
    +object ChiSquared extends Impurity {
    +  private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest()
    +
    +  /**
    +   * Get this impurity instance.
    +   * This is useful for passing impurity parameters to a Strategy in Java.
    +   */
    +  @Since("1.1.0")
    --- End diff --
    
    I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #95018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95018/testReport)** for PR 13440 at commit [`bb2f660`](https://github.com/apache/spark/commit/bb2f6607ef3d7921e9e2320da2d59dd068863a92).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Hi @wangmiao1981,
    
    I am still interested in this, but I don't have any sense about whether upstream has any interest.  Does upstream have any intention to accept it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104813864
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,162 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]]
    + * during binary classification.
    + */
    +@Since("2.0.0")
    +@Experimental
    +object ChiSquared extends Impurity {
    +  private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest()
    --- End diff --
    
    why not a private `val`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218252156
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
    @@ -71,6 +71,23 @@ object Gini extends Impurity {
       @Since("1.1.0")
       def instance: this.type = this
     
    +  /**
    +   * :: DeveloperApi ::
    +   * p-values for test-statistic measures, unsupported for [[Gini]]
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    I suspect that the generalization is closer to my newer signature 
    `val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)`
    where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218245670
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
         val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0
         val rightImpurity = rightImpurityCalculator.calculate()
     
    -    val leftWeight = leftCount / totalCount.toDouble
    -    val rightWeight = rightCount / totalCount.toDouble
    +    val gain = metadata.impurity match {
    +      case imp if (imp.isTestStatistic) =>
    +        // For split quality measures based on a test-statistic, run the test on the
    +        // left and right sub-populations to get a p-value for the null hypothesis
    +        val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
    +        // Transform the test statistic p-val into a larger-is-better gain value
    +        Impurity.pValToGain(pval)
    +
    +      case _ =>
    +        // Default purity-gain logic:
    +        // measure the weighted decrease in impurity from parent to the left and right
    +        val leftWeight = leftCount / totalCount.toDouble
    +        val rightWeight = rightCount / totalCount.toDouble
    +
    +        impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    +    }
     
    -    val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    +    // If the impurity being used is a test statistic p-val, apply a standard transform into
    +    // a larger-is-better gain value for the minimum-gain threshold
    +    val minGain =
    +      if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain)
    +      else metadata.minInfoGain
    --- End diff --
    
    The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better."  I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #94042 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94042/testReport)** for PR 13440 at commit [`bb2f660`](https://github.com/apache/spark/commit/bb2f6607ef3d7921e9e2320da2d59dd068863a92).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218208123
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
         val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0
         val rightImpurity = rightImpurityCalculator.calculate()
     
    -    val leftWeight = leftCount / totalCount.toDouble
    -    val rightWeight = rightCount / totalCount.toDouble
    +    val gain = metadata.impurity match {
    +      case imp if (imp.isTestStatistic) =>
    +        // For split quality measures based on a test-statistic, run the test on the
    +        // left and right sub-populations to get a p-value for the null hypothesis
    +        val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
    +        // Transform the test statistic p-val into a larger-is-better gain value
    +        Impurity.pValToGain(pval)
    +
    +      case _ =>
    +        // Default purity-gain logic:
    +        // measure the weighted decrease in impurity from parent to the left and right
    +        val leftWeight = leftCount / totalCount.toDouble
    +        val rightWeight = rightCount / totalCount.toDouble
    +
    +        impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    +    }
     
    -    val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    +    // If the impurity being used is a test statistic p-val, apply a standard transform into
    +    // a larger-is-better gain value for the minimum-gain threshold
    +    val minGain =
    +      if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain)
    +      else metadata.minInfoGain
    --- End diff --
    
    Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if `isTestStatistic` can go away with the right API, but I am not familiar with the details of what that requires.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by wangmiao1981 <gi...@git.apache.org>.
Github user wangmiao1981 commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @erikerlandson Are you still working on this PR? Thanks! Miao


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75812/testReport)** for PR 13440 at commit [`6762a18`](https://github.com/apache/spark/commit/6762a18dd558b61d5b292787115d6e8c8768ed12).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104812596
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -50,6 +50,50 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    +    throw new UnsupportedOperationException("Impurity.calculate")
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Determine if this impurity measure is a test-statistic measure
    +   * @return True if this is a split quality measure based on a test statistic (i.e. returns a
    +   * p-value) or false otherwise.
    +   * @note Unless overridden this method returns false by default, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def isTestStatistic: Boolean = false
    +}
    +
    +/**
    + * :: DeveloperApi ::
    + * Utility functions for Impurity measures
    + */
    +@Since("2.0.0")
    +@DeveloperApi
    --- End diff --
    
    there is no need for this object to be publicly exposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59745/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104822183
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -50,6 +50,50 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    +    throw new UnsupportedOperationException("Impurity.calculate")
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Determine if this impurity measure is a test-statistic measure
    +   * @return True if this is a split quality measure based on a test statistic (i.e. returns a
    +   * p-value) or false otherwise.
    +   * @note Unless overridden this method returns false by default, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def isTestStatistic: Boolean = false
    +}
    +
    +/**
    + * :: DeveloperApi ::
    + * Utility functions for Impurity measures
    + */
    +@Since("2.0.0")
    +@DeveloperApi
    --- End diff --
    
    I don't think so.  I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility."   The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #80093 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80093/testReport)** for PR 13440 at commit [`bb2f660`](https://github.com/apache/spark/commit/bb2f6607ef3d7921e9e2320da2d59dd068863a92).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #66679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66679/consoleFull)** for PR 13440 at commit [`b199ae3`](https://github.com/apache/spark/commit/b199ae3b39cda2147fb44f0e788c3be36c8dcf6a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb can you take a look at this now that 2.2 is out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #66766 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66766/consoleFull)** for PR 13440 at commit [`83f5e83`](https://github.com/apache/spark/commit/83f5e83fb87407bdd7dc8d740fba6fb30d1da3aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by wangmiao1981 <gi...@git.apache.org>.
Github user wangmiao1981 commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218209453
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -52,6 +52,49 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Determine if this impurity measure is a test-statistic measure
    +   * @return True if this is a split quality measure based on a test statistic (i.e. returns a
    +   * p-value) or false otherwise.
    +   * @note Unless overridden this method returns false by default, for backward compatability
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def isTestStatistic: Boolean
    --- End diff --
    
    Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75836/testReport)** for PR 13440 at commit [`6fea64d`](https://github.com/apache/spark/commit/6fea64ddee6b89f0a823f218a3d0119ced7fcc2f).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104822632
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala ---
    @@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite
         compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], numClasses)
       }
     
    +  test("split quality using chi-squared and minimum gain") {
    +    // Generate a data set where the 1st feature is useful and the others are noise
    +    val features = Vector.fill(200) {
    +      Array.fill(3) { scala.util.Random.nextInt(2).toDouble }
    +    }
    +    val labels = features.map { fv =>
    +      LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv))
    +    }
    +    val rdd = sc.parallelize(labels)
    +
    +    // two-class learning problem
    +    val numClasses = 2
    +    // all binary features
    +    val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 2) } : _*)
    +
    +    // Chi-squared split quality with a p-value threshold of 0.01 should allow
    +    // only the first feature to be used since the others are uncorrelated noise
    +    val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, numClasses)
    +    val dt = new DecisionTreeClassifier()
    +      .setImpurity("chisquared")
    +      .setMaxDepth(5)
    +      .setMinInfoGain(0.01)
    +    val treeModel = dt.fit(train)
    +
    +    // The tree should use exactly one of the 3 features: featue(0)
    --- End diff --
    
    \U0001f44d 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by shaneknapp <gi...@git.apache.org>.
Github user shaneknapp commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #59740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59740/consoleFull)** for PR 13440 at commit [`04c1316`](https://github.com/apache/spark/commit/04c131647f9cbc8208b755c6d08f16a1056f171c).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    I think targeting 3.0 with a refactor makes the most sense.  There's no way to do this without making small breaking changes, but slightly larger changes could clean up the design.  `ImpurityCalculator` can subsume `Impurity`, and a more general rethinking of gain and impurity can be accommodated too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Yeah I take your point that the trait Impurity already defines two methods, only one of which is implemented for each of the subclasses. It's already a funky design that probably should have been generalized differently. I think a rewrite for Spark 3 would be worthwhile, personally. I'm also not quite sure of the difference between the Impurity and ImpurityCalculator class; it seems like Impurity should fold into ImpurityCalculator. 
    
    Is the single method we really want to define something like `computeInformationGain(ImpurityCalculator, ImpurityCalculator)`? even the new method you've added is not directly computing info gain, nor were the existing ones in Impurity. But that's the thing we need and abstraction for over several implementations, it seems.
    
    Well, I think either this gets a bigger redesign in 3.0, or we try to get it into 2.5 and accept some API changes. I think I lean towards a bolder breaking change to fix it up in 3.0, unless there's a pressing need for this metric.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218209825
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
    @@ -71,6 +71,23 @@ object Gini extends Impurity {
       @Since("1.1.0")
       def instance: this.type = this
     
    +  /**
    +   * :: DeveloperApi ::
    +   * p-values for test-statistic measures, unsupported for [[Gini]]
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    --- End diff --
    
    It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #73008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73008/testReport)** for PR 13440 at commit [`61cbf7c`](https://github.com/apache/spark/commit/61cbf7c5aafe4cc2c59daf0b058e701a8545f4eb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75782 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75782/testReport)** for PR 13440 at commit [`6762a18`](https://github.com/apache/spark/commit/6762a18dd558b61d5b292787115d6e8c8768ed12).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r112719021
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,163 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]]
    --- End diff --
    
    @HyukjinKwon thank you!   I didn't notice that was in the full console output


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #76041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76041/testReport)** for PR 13440 at commit [`d2a2381`](https://github.com/apache/spark/commit/d2a2381edc7d2b2c395f0e0049e152190cd15827).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218208245
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,162 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating Chi Squared as a split quality metric during binary classification.
    + */
    +@Since("2.2.0")
    --- End diff --
    
    This will have to be 2.5.0 for the moment


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb still can't diagnose what the source of this "fails to generate doc" error is.  I don't see anything wrong with the scaladoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59740/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218209381
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -52,6 +52,49 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Determine if this impurity measure is a test-statistic measure
    +   * @return True if this is a split quality measure based on a test statistic (i.e. returns a
    +   * p-value) or false otherwise.
    +   * @note Unless overridden this method returns false by default, for backward compatability
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def isTestStatistic: Boolean
    +}
    +
    +/**
    + * :: DeveloperApi ::
    + * Utility functions for Impurity measures
    + */
    +@Since("2.0.0")
    +@DeveloperApi
    +object Impurity {
    +  /**
    +   * :: DeveloperApi ::
    +   * Convert a test-statistic p-value into a "larger-is-better" gain value.
    +   * @param pval The test statistic p-value
    +   * @return The negative logarithm of the p-value.  Any p-values smaller than 10^-20 are clipped
    +   * to 10^-20 to prevent arithmetic errors
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval))
    --- End diff --
    
    private to spark?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by thunterdb <gi...@git.apache.org>.
Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r104812484
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -50,6 +50,50 @@ trait Impurity extends Serializable {
       @Since("1.0.0")
       @DeveloperApi
       def calculate(count: Double, sum: Double, sumSquares: Double): Double
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Compute a test-statistic p-value quality measure from left and right split populations
    +   * @param calcL impurity calculator for the left split population
    +   * @param calcR impurity calculator for the right split population
    +   * @return The p-value for the null hypothesis; that left and right split populations
    +   * represent the same distribution
    +   * @note Unless overridden this method will fail with an exception, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
    +    throw new UnsupportedOperationException("Impurity.calculate")
    +
    +  /**
    +   * :: DeveloperApi ::
    +   * Determine if this impurity measure is a test-statistic measure
    +   * @return True if this is a split quality measure based on a test statistic (i.e. returns a
    +   * p-value) or false otherwise.
    +   * @note Unless overridden this method returns false by default, for backward compatability
    +   */
    +  @Since("2.0.0")
    +  @DeveloperApi
    +  def isTestStatistic: Boolean = false
    --- End diff --
    
    scala: do not add a default implementation, it causes issues with java compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218245862
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,162 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating Chi Squared as a split quality metric during binary classification.
    + */
    +@Since("2.2.0")
    --- End diff --
    
    I'll update those.  3.0 might be a good target, especially if I can't do this without new `isTestStatistic`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r112596837
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
    @@ -0,0 +1,163 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
    +
    +/**
    + * :: Experimental ::
    + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]]
    --- End diff --
    
    The documentation generation is being failed due to this line - 
    
    ```
    [error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/target/java/org/apache/spark/mllib/tree/impurity/ChiSquared.java:4: error: unexpected text
    [error]  * Class for calculating {@link https://en.wikipedia.org/wiki/Chi-squared_test chi-squared}
    [error]
    ```
    
    Probably, remove the link or wrap it `href` as I did before - https://github.com/apache/spark/pull/16013
    
    The other errors seem spurious. Please refer my observation - https://github.com/apache/spark/pull/17389#issuecomment-288438704


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #73008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73008/testReport)** for PR 13440 at commit [`61cbf7c`](https://github.com/apache/spark/commit/61cbf7c5aafe4cc2c59daf0b058e701a8545f4eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #75868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75868/testReport)** for PR 13440 at commit [`a75a01b`](https://github.com/apache/spark/commit/a75a01b34cc0288b77309ed83a20569b86cb69db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73008/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75836/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @thunterdb apologize for the latency.  I removed the default method defs and rebased


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    **[Test build #66766 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66766/consoleFull)** for PR 13440 at commit [`83f5e83`](https://github.com/apache/spark/commit/83f5e83fb87407bdd7dc8d740fba6fb30d1da3aa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org