You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2023/02/16 13:17:00 UTC

[jira] [Resolved] (STATISTICS-65) Reimplement the Kolmogorov-Smirnov Test

     [ https://issues.apache.org/jira/browse/STATISTICS-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Herbert resolved STATISTICS-65.
------------------------------------
    Resolution: Implemented

The updated KS test has been implemented in the port of the inference module (see STATISTICS-62)

This provides a result for the two-sample test that contains:
 * The D statistic
 * P-value for the D statistic
 * Flag indicating if there were ties in the data that could result in a different D statistic depending on the ordering of tied data
 * The upper D statistic (most extreme D possible from all possible resolutions of ties)
 * The p-value for the upper D statistic

This is the default behaviour. Computation of the upper bound on D is a negligible cost. Computation of the upper p-value does add a performance cost. This could be refactored to compute dynamically if required subject to performance testing (as computation of D may be a significant part of the entire cost of the function; currently the relative cost of computing the p-value is unknown).

An option is provided in the test to compute the p-value using an ESTIMATE method. Currently this only support the sampling method: N iterations are used to sample from the joint distribution of the two samples, each sample has the D value computed. The p-value is the fraction of D that was more extreme than the observed test statistic. Significant digits in the p-value are thus upper bounded by log10(iterations).

In this case the p-value for the upper D statistic is not computed. To compute it using the sampling method (from the same samples) would require modification of the short-circuit condition used on the D statistic and degrade performance. Note that comparison of the p-values of D and upper D are less relevant for estimates than for the exact p computation as the results are only one possible result from random sampling.

An option to sample/enumerate paths through the tied regions has not been added. This would provide a sample for, or the entire set of, all possible D and could thus provide an average p-value from the distribution of D. This is equivalent to performing a KS test on the data multiple times and resolving ties differently each time. The resulting p-value may result in either a type I (false positive) or type II (false negative) error. The current functionality to provide p-values for the extreme range of D provides a bound on the p-value to minimise type I or type II errors. It would be a lot more computation to create an average p-value with a questionable benefit for hypothesis testing. If the D statistic and upper D have very different p-values then the data has a large number of ties and the KS test is not suitable for the data. Thus this functionality has been omitted and can be added later if required.

 

> Reimplement the Kolmogorov-Smirnov Test
> ---------------------------------------
>
>                 Key: STATISTICS-65
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-65
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Major
>             Fix For: 1.1
>
>
> The inference module contains the Kolmogorov-Smirnov (KS) test. The two-sample test computes the D statistic as the supremum (effectively the maximum) difference between the empirical CDF of two samples:
>  
> {noformat}
> D  = sup | F(x) - G(x) |  = max(D+, D-)
> D+ = sup   F(x) - G(x)
> D- = sup   G(x) - F(x){noformat}
> Two-sided tests use D, one sided use D+ (greater than) or D- (less than).
>  
> The implementation in CM contains computation of the D value and p-values for D. Computation of D involves sorting the combined dataset of the two-samples (x_n and y_m) and moving D down 1/n if x is observed, otherwise moving up by 1/m if y is observed. The resulting D is the maximum deviation from 0, with the result in [0, 1]. (D+ is above 0; D- is below 0). This procedure does not work in the presence of ties as you do not know whether to adjust up or down, so the entire tied region is skipped and the cumulative effect on D is used.
> The computation of the p-value is adjusted based on the presence of ties in the data. Currently if ties are present the exact p-value is noted to be invalid. The method then randomly adjusts the input values to eliminate ties so that a D value can be created and the exact p-value computed for the random sample D of all possible D. This behaviour occurs without a user choice. Adjustment of data uses small steps (a few ULP) in order to minimise any change to the ranking of x and y. But depending on the data a change to the ranking can occur, and it can occur outside the tied region as all data is subject to random jitter, not just the tied values. Effectively the method is randomly adjusting the input data and hoping that it will not change non-tied regions just to generate a random path through the tied region and a single sample for D.
> Note the following observations:
>  * Ties do not make the p-value computation incorrect; they make the D value computation a single sample of possible D values. The p-value computation is still exact but it is using an input D with a value that is potentially lower than it could be, depending on how ties are resolved.
>  * Ties within the same sample will not change the result
>  * Ties between the two samples *may* change the result
>  * The distribution of D is discrete
>  * The distribution of D in the absence of ties has PMF(D=0) = 0. If there are no ties you either go up or down. This will move away from 0 at least 1 step.
> What is missing from the implementation is a detection of whether the ties matter. Any tied region can be traversed a number of different ways. If none of these possible combinations change the D value then the tied region is located where it does not matter, e.g.
> {noformat}
> [             4, 5, 6, 7, 8]
> [ 0, 1, 2, 3,       6      ]{noformat}
> This square data n=m takes 4x, 2y before the single tie then 2y. If the tie resolves as (1x, 1y) or (1y, 1x) then the maximum D is unchanged.
> A region of a and b ties for x and y will have binom(a+b, b) possible combinations. However the largest change to the D value is the two most extreme paths (ax, by) or (by, ax). Since the entire tied region is traversed in a single step this is trivial to compute as the number of steps in x and y is known. The implementation must then only track the D value when tied regions are skipped (current behaviour), or the most extreme D value through any tied regions. If the resulting max D is the same then none of the tied regions has an effect on D then the D value is unique among all possible D and the p-value is exact. If it is possible to traverse the tied regions and create a more extreme D then the p-value is not exact.
> I suggest updating the implementation to:
>  * Compute D
>  * Compute the most extreme D value generated through tied regions
>  * Return a result that indicates if the p-value is an estimate. Optionally the most extreme D statistic can be returned.
> Note that the most extreme D value is informative if for example it is much larger than D. This will occur when the tied regions are long. The most extreme case is input of matching data where D=0 but the extreme D value is 1.
> In the case where the tied regions can change the D value there are at least two options:
>  # Use the bootstrap sampling method in the current CM implementation. This generates n random samples (sx_n and sy_m) from the combined distribution of x and y and D values for each. The p-value is the number of times the sample D value for sx_n and sy_m was larger than the D for x and y. This only applies if the combined sample is representative of the shared distribution so is for large sample sizes.
>  # Walk n random paths through tied regions to create a distribution for D. Generate the average p value for all D in the sample. Note that if the number of the paths though the tied regions is small it is possible to enumerate them. The combinations of binom(a+b, b) can be efficiently enumerated using the Combinations class from the Numbers combinatorics module.
> The option to generate an estimate for P using a sampled distribution for D should not be performed in the main KS test routine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)