You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Phil Steitz (JIRA)" <ji...@apache.org> on 2015/07/07 05:59:04 UTC

[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

    [ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616134#comment-14616134 ] 

Phil Steitz edited comment on MATH-1246 at 7/7/15 3:59 AM:
-----------------------------------------------------------

I think the current implementation can be fixed as follows.  If we move to a faster implementation, the strategy below may not work.

What exactP does now is to exhaustively compute all possible D-statistics for all m-set / n-set partitions of m+n and simply tally the number that exceed (strict) or are as large as (not strict) the observed D.  If there are ties in the data, it is not correct to look at partitions of m+n, since not all partitions of an m+n set with duplicates are distinct and the set of possible D values is different in the presence of ties.  I think we can correctly handle ties in the data if we compute and tally D statistics based on a combined multi-set sample with duplicates in the positions corresponding to what is observed in the data.  For example, suppose that the two samples are x = [0, 3, 6, 9, 9, 10] and y = [1, 3, 4, 8, 11].  then the multi-set universe is  U = [0, 1, 3, 3, 4, 6, 8, 9, 9, 10, 11].  As before, we generate partitions of 11 into a 6-set and a 5-set, but instead of computing the D-statistics on the subsets of 11, we use indexes into U instead.  So if a generated split is mSet = [0, 2, 3, 7, 8, 9], nSet = [1, 4, 5, 6, 10], we compute D for [0, 3, 3, 9, 9, 10] and [1, 4, 6, 8, 11].  The rationale here is that the p-value is the probability that if U is split randomly into a 5-set and a 6-set, the D-value exceeds the observed d.


was (Author: psteitz):
I think the current implementation can be fixed as follows.  If we move to a faster implementation, the strategy below may not work.

What exactP does now is to exhaustively compute all possible D-statistics for all m-set / n-set partitions of m+n and simply tally the number that exceed (strict) or are as large as (not strict) the observed D.  If there are ties in the data, it is not correct to look at partitions of m+n, since not all partitions of an m+n set with duplicates are distinct and the set of possible D values is different in the presence of ties.  I think we can correctly handle ties in the data if we compute and tally D statistics based on a combined multi-set sample with duplicates in the positions corresponding to what is observed in the data.  For example, suppose that the two samples are x = [0, 3, 6, 9, 9, 10] and y = [1, 3, 4, 8, 11].  then the multi-set universe is  U = {0, 1, 3, 3, 4, 6, 8, 9, 9, 10, 11}.  As before, we generate partitions of 11 into a 6-set and a 5-set, but instead of computing the D-statistics on the subsets of 11, we use indexes into U instead.  So if a generated split is mSet = {0, 2, 3, 7, 8, 9}, nSet = {1, 4, 5, 6, 10}, we compute D for [0, 3, 3, 9, 9, 10] and [1, 4, 6, 8, 11].  The rationale here is that the p-value is the probability that if U is split randomly into a 5-set and a 6-set, the D-value exceeds the observed d.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the distribution of a D-statistic for m-n sets with no ties.  No warning or special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)