You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2023/02/03 15:25:00 UTC

[jira] [Commented] (STATISTICS-62) Port o.a.c.math.stat.inference to a commons-statistics-inference module

    [ https://issues.apache.org/jira/browse/STATISTICS-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683918#comment-17683918 ] 

Alex Herbert commented on STATISTICS-62:
----------------------------------------

I have finished the work on the computations performed by the new inference module. This has led to development of the API based on the currently supported options.

The first observation is that all tests are separated into: creating a statistic; creating a p-value for the statistic; computing a boolean value to reject the null hypothesis given a significance level. This is trivially:
{code:java}
return p < alpha;
{code}
It is extreme code bloat to duplicate methods just to pass a significance level and perform this boolean expression. Also note that if you require a p-value then you also have to have a statistic, so these should be paired in a result (statistic, p-value).

I have written each test to have the following generic API where methods have compulsory arguments and optional ones. The syntax below is akin to a language that supports optional named arguments:
{code:java}
double statistic(x, y, option1=a)
SignificanceResult test(x, y, option1=a, option2=b, option3=c){code}
The test result is:
{code:java}
public interface SignificanceResult {
    double getStatistic();
    double getPValue();
    default boolean reject(double alpha) {
        // validate alpha in (0, 0.5], then
        return getPValue() < alpha;
    }
} {code}
Tests may return more information by extending the SignificanceResult. This is actually useful for some tests which have a lot more information, for example the OneWayAnova test can return all data typically reported for ANOVA tests (degrees of freedom between and within groups, variances between and within groups).

Note that the statistic method is seemingly redundant as you can call test and extract the statistic from the result. However the use case is when you have to compare a statistic against a pre-computed critical value (e.g. from a table of critical values). Here you do not require the computation effort to generate the p-value. An extreme example is each build of the Commons RNG core module performs approximately 17*500 chi-square tests for uniformity per RNG implementation (50 current tested instances) which is at least 425,000 tests per build, all using the same critical value. There are other places where a critical value is used too so this is an underestimate.

Also note that this removes the ability to compute a p-value given a statistic. However this is functionality that belongs in the Statistics distribution package. The only distributions not there that are required are the distributions for the Kolmogorov-Smirnov, Mann-Whitney U and the Wilcox signed rank statistic. Since these only require the p-value from the survival function the implementations are partial and are missing CDF, PDF and moments to allow inclusion in the distribution package. The implementations could be ported there if a full implementation is completed. I am not aware of the usefulness of these distributions outside of inference testing.

Since Java does not support optional arguments there are a few ways to implement the API. Options can be strongly typed as immutable objects with properties. The example below shows this using a builder pattern for the Kolomorov-Smirnov test, the example below is SciPy's test signature to which I have added the ability to compute the p-value with a strict inequality (an option carried over from the CM implementation):
{noformat}
scipy.stats.ks_2samp(data1, data2, alternative='two-sided', method='auto', strict=False){noformat}
Java with Options:
{code:java}
public final class KolmogorovSmirnovTest {
    public static class Options {
        public static class Builder {
            public Builder setAlternative(AlternativeHypothesis v);
            public Builder setPValueMethod(PValueMethod v);
            public Builder setStrictInequality(boolean v);
            public Options build();
        }
        public static Options defaults();
        public static Builder builder();
        public Builder toBuilder();
        public AlternativeHypothesis getAlternative();
        public PValueMethod getPValueMethod();
        public boolean isStrictInequality();
    }
    public static double statistic(double[] x, double[] y,
                                   AlternativeHypothesis alternative) {
    public static SignificanceResult test(double[] x, double[] y) {
        return test(x, y, Options.defaults());
    }
    public static SignificanceResult test(double[] x, double[] y, Options options);
} {code}
Calling it with the defaults is simple, with any other options is quite verbose:
{code:java}
double[] x, y;
SignificanceResult r1 = KolmogorovSmirnovTest.test(x, y);
SignificanceResult r2 = KolmogorovSmirnovTest.test(x, y,
    Options.builder().setAlternative(AlternativeHypothesis.GREATER_THAN)
                     .setPValueMethod(PValueMethod.EXACT)
                     .setStrictInequality(true)
                     .build();{code}
Note that for repeat testing the options can be pre-built and passed in.

A simpler API without the bloat of strongly typed options (with some way to build them) is to have optional arguments as a varargs array:
{code:java}
public final class KolmogorovSmirnovTest {
    public static double statistic(double[] x, double[] y,
                                   AlternativeHypothesis alternative) {
    public static SignificanceResult test(double[] x, double[] y, Object... options);
}  {code}
Calling it then becomes:
{code:java}
double[] x, y;
SignificanceResult r1 = KolmogorovSmirnovTest.test(x, y);
SignificanceResult r2 = KolmogorovSmirnovTest.test(x, y,
    AlternativeHypothesis.GREATER_THAN,
    PValueMethod.EXACT,
    Inequality.STRICT); {code}
Here the Object[] must be parsed by the test method to extract any options it recognises. This is similar to the Optimizer API in CM4 (see [BaseOptmizer.optimize|https://commons.apache.org/proper/commons-math/javadocs/api-4.0-beta1/org/apache/commons/math4/legacy/optim/BaseOptimizer.html#optimize(org.apache.commons.math4.legacy.optim.OptimizationData...)]) but without all options required to implement a marker interface, e.g.:
{code:java}
public final class KolmogorovSmirnovTest {     
    // ...
    public static SignificanceResult test(double[] x, double[] y, TestOption... options);
} {code}
When using varargs any primitive values must be wrapped with a class that can be uniquely identified. Hence the API for the chi-square test with an optional degrees of freedom adjustment is called using:
{code:java}
public final class ChiSquareTest {
    // ...
    public static SignificanceResult test(double[] expected, long[] observed, Object... options)
}

ChiSquareTest.test(expected, observed, DegreesOfFreedomAdjustment.of(1));{code}
This highlights the issue where tests only have a single option. For consistency the API would specify the varargs. But for simplicity the method can be provided with the optional parameter as an overloaded method.

What I do not wish to happen is that the API is expanded over time with a daisy chain of overloaded methods as more options are added to existing tests. So to prevent this I would recommend some type of minimum API that naturally expands to accommodate additional options.

Currently the API consists of:
{noformat}
BinomialTest:
// statistic = numberOfTrials / numberOfSuccesses so is omitted from the API
test(int numberOfTrials, int numberOfSuccesses, double probability, alternative=two-sided)

ChiSquareTest
statistic(long[] observed)
statistic(double[] expected, long[] observed)
statistic(long[][] counts)
statistic(long[] observed1, long[] observed2)
test(long[] observed, degreesOfFreedomAdjustment=0)
test(double[] expected, long[] observed, degreesOfFreedomAdjustment=0)
test(long[][] counts)
test(long[] observed1, long[] observed2)

GTest:
statistic(long[] observed)
statistic(double[] expected, long[] observed)
statistic(long[][] counts)
test(long[] observed, degreesOfFreedomAdjustment=0)
test(double[] expected, long[] observed, degreesOfFreedomAdjustment=0)
test(long[][] counts)

KolmogorovSmirnovTest:
statistic(double[] x, DoubleUnaryOperator cdf, alternative=two-sided)
statistic(double[] x, double[] y, alternative=two-sided)
test(double[] x, DoubleUnaryOperator cdf, alternative=two-sided, method=auto)
test(double[] x, double[] y, alternative=two-sided, method=auto, strict=false)
estimateP(double[] x, double[] y,
          UniformRandomProvider rng,
          int iterations,
          method=[sampling, random-walk],
          alternative=two-sided, strict=false)

MannWhitneyUTest:
statistic(double[] x, double[] y)
test(double[] x, double[] y, alternative=two-sided, method=auto, correct=true)

OneWayAnova:
// statistic is omitted as the statistic must be specified with degrees of freedom: (F, df_bg, df_wg)
test(Collection<double[]> data)

TTest:
statistic(m, v, n, mu=0)
statistic(double[] x, m=0)
pairedStatistic(double[] x, double[] y, mu=0)
statistic(m1, v1, n1, m2, v2, n2, mu=0, homoscedastic=false)
statistic(double[] x, double[] y, mu=0, homoscedastic=false)
test(m, v, n, mu=0, alternative=two-sided)
test(double[] x, mu=0, alternative=two-sided)
pairedTest(double[] x, double[] y, mu=0, alternative=two-sided)
test(m1, v1, n1, m2, v2, n2, mu=0, homoscedastic=false, alternative=two-sided)
test(double[] x, double[] y, mu=0, homoscedastic=false, alternative=two-sided)

WilcoxonSignedRankTest:
statistic(double[] z)
statistic(double[] x, double[] y)
test(double[] z, alternative=two-sided, method=auto, correct=true)
test(double[] x, double[] y, alternative=two-sided, method=auto, correct=true){noformat}
Note that the paired TTest could be provided as an option for the two-sample test, i.e. paired or unpaired. This is the way it is implemented in R. In SciPy they provide a method for two-sample independent (scipy.stats.ttest_ind) and two-sample related (scipy.stats.ttest_rel).

The KolmogorovSmirnovTest has a method to estimate p-values. The CM implementation has two estimation methods requiring a random generator and also functionality to removes ties in the data using randomness. I have changed the functionality but the details should be under a separate ticket. Here we will assume that the standard statistic and p-value computation are deterministic and any non-deterministic estimation is in a separate method, thus the user is aware they are using randomness to generate the result. The API choice then becomes how to pass non-default parameters to the estimation method, e.g. those controlling the estimation procedure.

 

Currently I am favouring the test(x, y, Object... options) API to remove all the bloat of builders for Options. It allows more options to be added with no API changes. Any opinions on this would be welcome.

 

> Port o.a.c.math.stat.inference to a commons-statistics-inference module
> -----------------------------------------------------------------------
>
>                 Key: STATISTICS-62
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-62
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Priority: Major
>
> The o.a.c.math4.legacy.stat.inference package contains:
>  
> {noformat}
> AlternativeHypothesis.java
> BinomialTest.java
> ChiSquareTest.java
> GTest.java
> InferenceTestUtils.java
> KolmogorovSmirnovTest.java
> MannWhitneyUTest.java
> OneWayAnova.java
> TTest.java
> WilcoxonSignedRankTest.java{noformat}
> The are few dependencies on other math packages. The notable exceptions are:
>  
> 1. KolmogorovSmirnovTest which requires matrix support. This is for multiplication of a square matrix to support a matrix power function. This uses a double matrix and the same code is duplicated for a BigFraction matrix. Such code can be ported internally to support only the required functions. It can also drop the defensive copy strategy used by Commons Math in matrices to allow multiply in-place where appropriate for performance gains.
> 2. OneWayAnova which collates the sum, sum of squares and count using SummaryStatistics. This can be done using an internal class. It is possible to call the test method using already computed SummaryStatistics. The method that does this using the SummaryStatistics as part of the API can be dropped, or supported using an interface that returns: getSum, getSumOfSquares, getN.
> All the inference Test classes have instance methods but no state. The InferenceTestUtils is a static class that holds references to a singleton for each class and provides static methods to pass through the underlying instances.
> I suggest changing the test classes to have only static methods and dropping InferenceTestUtils.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)