You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/21 23:01:33 UTC

[GitHub] [spark] atronchi opened a new pull request #26197: Implement p-value simulation and unit tests for chi2 test

atronchi opened a new pull request #26197: Implement p-value simulation and unit tests for chi2 test
URL: https://github.com/apache/spark/pull/26197
 
 
   ### What changes were proposed in this pull request?
   This PR implements monte-carlo simulation of p-values for the ChiSqTest in mllib. For other implementations, see the following references:
   * https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/chisq.test
   * https://en.wikipedia.org/wiki/Generalized_p-value
   
   ### Why are the changes needed?
   While monte-carlo simulation is a common approach to estimate p-values, a robust scalable implementation in Spark was non-trivial, so we hope others can re-use these efforts.
   
   ### Does this PR introduce any user-facing change?
   We provide a new boolean parameter `simulatePValue` to the ChiSqTest so that users can request p-value simulation, and also an integer parameter `numDraw` so that users can specify the number of draws to take. The `getChi2Digest` method is also exposed in case users find value in the digest object itself which allows extraction of arbitrary quantiles, cdf, etc.
   
   ### How was this patch tested?
   This PR also implements the `ChiSqTestSuite` with some tests to verify that both the ChiSqTest itself and the new p-value simulation are working correctly by evaluating that test cases expected to pass and fail a chi squared test actually work as expected. 
   
   We ran these tests with the following results: 
   ```
   $ build/mvn package -pl mllib -Dtest=none -DwildcardSuites=org.apache.spark.mllib.stat.test.ChiSqTestSuite
   ...
   ChiSqTestSuite:
   - theoretical chi2 test
   - simulated/empirical chi2 test
   Run completed in 1 minute, 22 seconds.
   Total number of tests run: 2
   Suites: completed 2, aborted 0
   Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ...
   [INFO] BUILD SUCCESS
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org