You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Phil Steitz (JIRA)" <ji...@apache.org> on 2015/11/09 22:15:10 UTC

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

    [ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997372#comment-14997372 ] 

Phil Steitz commented on MATH-1246:
-----------------------------------

I did some more extensive testing against R's ks.boot and found significant differences from the code in ce98d00852e21ce34d8d247db7f6be138967b559.  I have determined the reason why the results are different and that my initial approach was incorrect.  The difference is due to the fact that ks.boot samples "with replacement" from the combined empirical distribution while my approach constrains the n-m split to be a split that can be achieved using the combined dataset.  I interpreted the p-value to be essentially the same as in the no ties case - what is the probability that when the combined set of values is split into an n-set and an m-set, the KS statistic is greater than or equal to what we observe in the data.  The theoretical development in [1] and the implementation in ks.boot define the p-value to be the probability that when an m-set and n-set are drawn independently from the combined empirical distribution, the p-value exceeds what we see in the data.  This is not the same and when there a lot of ties the estimates diverge.  Apologies for being a little dense on this.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the distribution of a D-statistic for m-n sets with no ties.  No warning or special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)