You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Phil Steitz <ph...@gmail.com> on 2013/07/20 19:01:42 UTC

[math] Kolmogorov-Smirnov 2-sample test

I am working on MATH-437 (turning K-S distribution into a proper K-S
test impl) and have to decide how to implement 2-sample tests. 
Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
K-S distribution, so for large samples just using the cdf we already
have is appropriate.  For small samples (actually for any size
sample), the test statistic distribution is discrete and can be
computed exactly.  A brute force way to do that is to enumerate all
of the n-m partitions of {0, ..., n+m-1} and compute all the
possible D_n,m values.  R seems to use a more clever way to do
this.  Does anyone have a reference for an efficient way to compute
the exact distribution, or background on where R got their
implementation?

Absent a "clever" approach, I see three alternatives and would
appreciate some feedback:

0) just use the asymptotic distribution even for small samples
1) fully enumerate all n-m partitions and compute the D_n,m as above
1) use a monte carlo approach - instead of full enumeration of the
D_n,m, randomly generate a large number of splits and compute the
p-value for observed D_n,m by computing the number of random n-m
splits generate a D value less than what is observed.

Thanks in advance for any feedback / pointers.

Phil

[1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Phil Steitz <ph...@gmail.com>.

On 8/10/13 11:00 AM, Phil Steitz wrote:
> On 8/10/13 10:41 AM, Ajo Fod wrote:
>> This paper provides some approximations and situations where you could use
>> exact computations:
>> http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf
>> ... they even refer to a java program at the end of page 11.
> He he.  This is the reference that our current K-S impl is based on
> :)  So our impl correctly handles small samples for the 1-sample test.
>
> This paper does provide some references for exact computation of
> D_n.  Unfortunately, what I need is D_n,m for the two-sample test
> and while asymptotically D_n,m is K-S distributed, I don't think I
> can adapt exact algorithms for D_n to compute exact D_n,m.  I could
> be wrong here - would love to be shown how to do it.

Sorry, everywhere above I meant "the probability distribution of
D_n/D_n,m" in place of "D_n/D_n,m"

Phil
>
> Phil
>> Cheers,
>> Ajo.
>>
>>
>> On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <ph...@gmail.com> wrote:
>>
>>> On 8/10/13 9:35 AM, Ajo Fod wrote:
>>>> In:
>>>>
>>> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf
>>>> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
>>>> page 88. You already have the 1 sample table.
>>> The problem is that I don't have the *exact* 1 sample table, or more
>>> precisely the means to generate it.  What our current machinery
>>> allows us to compute is an asymptotically valid approximation to the
>>> distribution of D_n,m.  Theorem 2 on page 87 of the reference above
>>> justifies the large sample approach; but it is an asymptotic
>>> result.   Using it is fine for large n, m, but not so good for small
>>> samples.  For that we need the exact, discrete distribution of
>>> D_n,m.  Like other math stat references I have consulted, the one
>>> above states that the discrete distribution has been tabulated for
>>> small n,m and those tables are available in most math stat texts.
>>> What I need is the algorithm used to generate those tables.
>>>
>>> Phil
>>>> Cheers,
>>>> Ajo
>>>>
>>>>
>>>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <ph...@gmail.com>
>>> wrote:
>>>>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>>>>> This depends on data size. If it fits in memory, a single pass through
>>>>> the
>>>>>> sorted array to find the biggest differences would suffice.
>>>>>>
>>>>>> If the data doesn't fit, you probably need a StorelessQuantile
>>> estimator
>>>>>> like QuantileBin1D from the colt libraries. Then pick a resolution and
>>> do
>>>>>> the single pass search.
>>>>> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
>>>>> problem is in computing the exact p-values for the test.  For that,
>>>>> I need to compute the exact distribution of D_n,m.  Brute-forcing
>>>>> requires that you examine every element of n + m choose n.  R seems
>>>>> to use a clever approach, but there is no documentation in the R
>>>>> sources on how the algorithm works.  Moreover, my first attempts at
>>>>> Monte Carlo simulation don't give the same results.  Most likely, I
>>>>> have not set the simulation up correctly.  Any better ideas or
>>>>> references on how to compute the exact distribution would be
>>>>> appreciated.
>>>>>
>>>>> Phil
>>>>>> Cheers,
>>>>>> -Ajo
>>>>>>
>>>>>>
>>>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com>
>>>>> wrote:
>>>>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>>>>>>> test impl) and have to decide how to implement 2-sample tests.
>>>>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>>>>>>> K-S distribution, so for large samples just using the cdf we already
>>>>>>> have is appropriate.  For small samples (actually for any size
>>>>>>> sample), the test statistic distribution is discrete and can be
>>>>>>> computed exactly.  A brute force way to do that is to enumerate all
>>>>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>>>>>>> possible D_n,m values.  R seems to use a more clever way to do
>>>>>>> this.  Does anyone have a reference for an efficient way to compute
>>>>>>> the exact distribution, or background on where R got their
>>>>>>> implementation?
>>>>>>>
>>>>>>> Absent a "clever" approach, I see three alternatives and would
>>>>>>> appreciate some feedback:
>>>>>>>
>>>>>>> 0) just use the asymptotic distribution even for small samples
>>>>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>>>>>>> 1) use a monte carlo approach - instead of full enumeration of the
>>>>>>> D_n,m, randomly generate a large number of splits and compute the
>>>>>>> p-value for observed D_n,m by computing the number of random n-m
>>>>>>> splits generate a D value less than what is observed.
>>>>>>>
>>>>>>> Thanks in advance for any feedback / pointers.
>>>>>>>
>>>>>>> Phil
>>>>>>>
>>>>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Phil Steitz <ph...@gmail.com>.

On 8/10/13 10:41 AM, Ajo Fod wrote:
> This paper provides some approximations and situations where you could use
> exact computations:
> http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf
> ... they even refer to a java program at the end of page 11.

He he.  This is the reference that our current K-S impl is based on
:)  So our impl correctly handles small samples for the 1-sample test.

This paper does provide some references for exact computation of
D_n.  Unfortunately, what I need is D_n,m for the two-sample test
and while asymptotically D_n,m is K-S distributed, I don't think I
can adapt exact algorithms for D_n to compute exact D_n,m.  I could
be wrong here - would love to be shown how to do it.

Phil
>
> Cheers,
> Ajo.
>
>
> On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <ph...@gmail.com> wrote:
>
>> On 8/10/13 9:35 AM, Ajo Fod wrote:
>>> In:
>>>
>> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf
>>> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
>>> page 88. You already have the 1 sample table.
>> The problem is that I don't have the *exact* 1 sample table, or more
>> precisely the means to generate it.  What our current machinery
>> allows us to compute is an asymptotically valid approximation to the
>> distribution of D_n,m.  Theorem 2 on page 87 of the reference above
>> justifies the large sample approach; but it is an asymptotic
>> result.   Using it is fine for large n, m, but not so good for small
>> samples.  For that we need the exact, discrete distribution of
>> D_n,m.  Like other math stat references I have consulted, the one
>> above states that the discrete distribution has been tabulated for
>> small n,m and those tables are available in most math stat texts.
>> What I need is the algorithm used to generate those tables.
>>
>> Phil
>>> Cheers,
>>> Ajo
>>>
>>>
>>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <ph...@gmail.com>
>> wrote:
>>>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>>>> This depends on data size. If it fits in memory, a single pass through
>>>> the
>>>>> sorted array to find the biggest differences would suffice.
>>>>>
>>>>> If the data doesn't fit, you probably need a StorelessQuantile
>> estimator
>>>>> like QuantileBin1D from the colt libraries. Then pick a resolution and
>> do
>>>>> the single pass search.
>>>> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
>>>> problem is in computing the exact p-values for the test.  For that,
>>>> I need to compute the exact distribution of D_n,m.  Brute-forcing
>>>> requires that you examine every element of n + m choose n.  R seems
>>>> to use a clever approach, but there is no documentation in the R
>>>> sources on how the algorithm works.  Moreover, my first attempts at
>>>> Monte Carlo simulation don't give the same results.  Most likely, I
>>>> have not set the simulation up correctly.  Any better ideas or
>>>> references on how to compute the exact distribution would be
>>>> appreciated.
>>>>
>>>> Phil
>>>>> Cheers,
>>>>> -Ajo
>>>>>
>>>>>
>>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com>
>>>> wrote:
>>>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>>>>>> test impl) and have to decide how to implement 2-sample tests.
>>>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>>>>>> K-S distribution, so for large samples just using the cdf we already
>>>>>> have is appropriate.  For small samples (actually for any size
>>>>>> sample), the test statistic distribution is discrete and can be
>>>>>> computed exactly.  A brute force way to do that is to enumerate all
>>>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>>>>>> possible D_n,m values.  R seems to use a more clever way to do
>>>>>> this.  Does anyone have a reference for an efficient way to compute
>>>>>> the exact distribution, or background on where R got their
>>>>>> implementation?
>>>>>>
>>>>>> Absent a "clever" approach, I see three alternatives and would
>>>>>> appreciate some feedback:
>>>>>>
>>>>>> 0) just use the asymptotic distribution even for small samples
>>>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>>>>>> 1) use a monte carlo approach - instead of full enumeration of the
>>>>>> D_n,m, randomly generate a large number of splits and compute the
>>>>>> p-value for observed D_n,m by computing the number of random n-m
>>>>>> splits generate a D value less than what is observed.
>>>>>>
>>>>>> Thanks in advance for any feedback / pointers.
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Ajo Fod <aj...@gmail.com>.

This paper provides some approximations and situations where you could use
exact computations:
http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf
... they even refer to a java program at the end of page 11.

Cheers,
Ajo.


On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <ph...@gmail.com> wrote:

> On 8/10/13 9:35 AM, Ajo Fod wrote:
> > In:
> >
> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf
> >
> > Take a look at 2 sample KS stats and the relationship to the 1 sample ...
> > page 88. You already have the 1 sample table.
>
> The problem is that I don't have the *exact* 1 sample table, or more
> precisely the means to generate it.  What our current machinery
> allows us to compute is an asymptotically valid approximation to the
> distribution of D_n,m.  Theorem 2 on page 87 of the reference above
> justifies the large sample approach; but it is an asymptotic
> result.   Using it is fine for large n, m, but not so good for small
> samples.  For that we need the exact, discrete distribution of
> D_n,m.  Like other math stat references I have consulted, the one
> above states that the discrete distribution has been tabulated for
> small n,m and those tables are available in most math stat texts.
> What I need is the algorithm used to generate those tables.
>
> Phil
> >
> > Cheers,
> > Ajo
> >
> >
> > On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <ph...@gmail.com>
> wrote:
> >
> >> On 8/10/13 8:59 AM, Ajo Fod wrote:
> >>> This depends on data size. If it fits in memory, a single pass through
> >> the
> >>> sorted array to find the biggest differences would suffice.
> >>>
> >>> If the data doesn't fit, you probably need a StorelessQuantile
> estimator
> >>> like QuantileBin1D from the colt libraries. Then pick a resolution and
> do
> >>> the single pass search.
> >> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
> >> problem is in computing the exact p-values for the test.  For that,
> >> I need to compute the exact distribution of D_n,m.  Brute-forcing
> >> requires that you examine every element of n + m choose n.  R seems
> >> to use a clever approach, but there is no documentation in the R
> >> sources on how the algorithm works.  Moreover, my first attempts at
> >> Monte Carlo simulation don't give the same results.  Most likely, I
> >> have not set the simulation up correctly.  Any better ideas or
> >> references on how to compute the exact distribution would be
> >> appreciated.
> >>
> >> Phil
> >>> Cheers,
> >>> -Ajo
> >>>
> >>>
> >>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com>
> >> wrote:
> >>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
> >>>> test impl) and have to decide how to implement 2-sample tests.
> >>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
> >>>> K-S distribution, so for large samples just using the cdf we already
> >>>> have is appropriate.  For small samples (actually for any size
> >>>> sample), the test statistic distribution is discrete and can be
> >>>> computed exactly.  A brute force way to do that is to enumerate all
> >>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
> >>>> possible D_n,m values.  R seems to use a more clever way to do
> >>>> this.  Does anyone have a reference for an efficient way to compute
> >>>> the exact distribution, or background on where R got their
> >>>> implementation?
> >>>>
> >>>> Absent a "clever" approach, I see three alternatives and would
> >>>> appreciate some feedback:
> >>>>
> >>>> 0) just use the asymptotic distribution even for small samples
> >>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
> >>>> 1) use a monte carlo approach - instead of full enumeration of the
> >>>> D_n,m, randomly generate a large number of splits and compute the
> >>>> p-value for observed D_n,m by computing the number of random n-m
> >>>> splits generate a D value less than what is observed.
> >>>>
> >>>> Thanks in advance for any feedback / pointers.
> >>>>
> >>>> Phil
> >>>>
> >>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Phil Steitz <ph...@gmail.com>.

On 8/10/13 9:35 AM, Ajo Fod wrote:
> In:
> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf
>
> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
> page 88. You already have the 1 sample table.

The problem is that I don't have the *exact* 1 sample table, or more
precisely the means to generate it.  What our current machinery
allows us to compute is an asymptotically valid approximation to the
distribution of D_n,m.  Theorem 2 on page 87 of the reference above
justifies the large sample approach; but it is an asymptotic
result.   Using it is fine for large n, m, but not so good for small
samples.  For that we need the exact, discrete distribution of
D_n,m.  Like other math stat references I have consulted, the one
above states that the discrete distribution has been tabulated for
small n,m and those tables are available in most math stat texts. 
What I need is the algorithm used to generate those tables.

Phil
>
> Cheers,
> Ajo
>
>
> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <ph...@gmail.com> wrote:
>
>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>> This depends on data size. If it fits in memory, a single pass through
>> the
>>> sorted array to find the biggest differences would suffice.
>>>
>>> If the data doesn't fit, you probably need a StorelessQuantile estimator
>>> like QuantileBin1D from the colt libraries. Then pick a resolution and do
>>> the single pass search.
>> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
>> problem is in computing the exact p-values for the test.  For that,
>> I need to compute the exact distribution of D_n,m.  Brute-forcing
>> requires that you examine every element of n + m choose n.  R seems
>> to use a clever approach, but there is no documentation in the R
>> sources on how the algorithm works.  Moreover, my first attempts at
>> Monte Carlo simulation don't give the same results.  Most likely, I
>> have not set the simulation up correctly.  Any better ideas or
>> references on how to compute the exact distribution would be
>> appreciated.
>>
>> Phil
>>> Cheers,
>>> -Ajo
>>>
>>>
>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com>
>> wrote:
>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>>>> test impl) and have to decide how to implement 2-sample tests.
>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>>>> K-S distribution, so for large samples just using the cdf we already
>>>> have is appropriate.  For small samples (actually for any size
>>>> sample), the test statistic distribution is discrete and can be
>>>> computed exactly.  A brute force way to do that is to enumerate all
>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>>>> possible D_n,m values.  R seems to use a more clever way to do
>>>> this.  Does anyone have a reference for an efficient way to compute
>>>> the exact distribution, or background on where R got their
>>>> implementation?
>>>>
>>>> Absent a "clever" approach, I see three alternatives and would
>>>> appreciate some feedback:
>>>>
>>>> 0) just use the asymptotic distribution even for small samples
>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>>>> 1) use a monte carlo approach - instead of full enumeration of the
>>>> D_n,m, randomly generate a large number of splits and compute the
>>>> p-value for observed D_n,m by computing the number of random n-m
>>>> splits generate a D value less than what is observed.
>>>>
>>>> Thanks in advance for any feedback / pointers.
>>>>
>>>> Phil
>>>>
>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Ajo Fod <aj...@gmail.com>.

In:
http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf

Take a look at 2 sample KS stats and the relationship to the 1 sample ...
page 88. You already have the 1 sample table.

Cheers,
Ajo


On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <ph...@gmail.com> wrote:

> On 8/10/13 8:59 AM, Ajo Fod wrote:
> > This depends on data size. If it fits in memory, a single pass through
> the
> > sorted array to find the biggest differences would suffice.
> >
> > If the data doesn't fit, you probably need a StorelessQuantile estimator
> > like QuantileBin1D from the colt libraries. Then pick a resolution and do
> > the single pass search.
>
> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
> problem is in computing the exact p-values for the test.  For that,
> I need to compute the exact distribution of D_n,m.  Brute-forcing
> requires that you examine every element of n + m choose n.  R seems
> to use a clever approach, but there is no documentation in the R
> sources on how the algorithm works.  Moreover, my first attempts at
> Monte Carlo simulation don't give the same results.  Most likely, I
> have not set the simulation up correctly.  Any better ideas or
> references on how to compute the exact distribution would be
> appreciated.
>
> Phil
> >
> > Cheers,
> > -Ajo
> >
> >
> > On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com>
> wrote:
> >
> >> I am working on MATH-437 (turning K-S distribution into a proper K-S
> >> test impl) and have to decide how to implement 2-sample tests.
> >> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
> >> K-S distribution, so for large samples just using the cdf we already
> >> have is appropriate.  For small samples (actually for any size
> >> sample), the test statistic distribution is discrete and can be
> >> computed exactly.  A brute force way to do that is to enumerate all
> >> of the n-m partitions of {0, ..., n+m-1} and compute all the
> >> possible D_n,m values.  R seems to use a more clever way to do
> >> this.  Does anyone have a reference for an efficient way to compute
> >> the exact distribution, or background on where R got their
> >> implementation?
> >>
> >> Absent a "clever" approach, I see three alternatives and would
> >> appreciate some feedback:
> >>
> >> 0) just use the asymptotic distribution even for small samples
> >> 1) fully enumerate all n-m partitions and compute the D_n,m as above
> >> 1) use a monte carlo approach - instead of full enumeration of the
> >> D_n,m, randomly generate a large number of splits and compute the
> >> p-value for observed D_n,m by computing the number of random n-m
> >> splits generate a D value less than what is observed.
> >>
> >> Thanks in advance for any feedback / pointers.
> >>
> >> Phil
> >>
> >> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Phil Steitz <ph...@gmail.com>.

On 8/10/13 8:59 AM, Ajo Fod wrote:
> This depends on data size. If it fits in memory, a single pass through the
> sorted array to find the biggest differences would suffice.
>
> If the data doesn't fit, you probably need a StorelessQuantile estimator
> like QuantileBin1D from the colt libraries. Then pick a resolution and do
> the single pass search.

Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
problem is in computing the exact p-values for the test.  For that,
I need to compute the exact distribution of D_n,m.  Brute-forcing
requires that you examine every element of n + m choose n.  R seems
to use a clever approach, but there is no documentation in the R
sources on how the algorithm works.  Moreover, my first attempts at
Monte Carlo simulation don't give the same results.  Most likely, I
have not set the simulation up correctly.  Any better ideas or
references on how to compute the exact distribution would be
appreciated.

Phil
>
> Cheers,
> -Ajo
>
>
> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com> wrote:
>
>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>> test impl) and have to decide how to implement 2-sample tests.
>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>> K-S distribution, so for large samples just using the cdf we already
>> have is appropriate.  For small samples (actually for any size
>> sample), the test statistic distribution is discrete and can be
>> computed exactly.  A brute force way to do that is to enumerate all
>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>> possible D_n,m values.  R seems to use a more clever way to do
>> this.  Does anyone have a reference for an efficient way to compute
>> the exact distribution, or background on where R got their
>> implementation?
>>
>> Absent a "clever" approach, I see three alternatives and would
>> appreciate some feedback:
>>
>> 0) just use the asymptotic distribution even for small samples
>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>> 1) use a monte carlo approach - instead of full enumeration of the
>> D_n,m, randomly generate a large number of splits and compute the
>> p-value for observed D_n,m by computing the number of random n-m
>> splits generate a D value less than what is observed.
>>
>> Thanks in advance for any feedback / pointers.
>>
>> Phil
>>
>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Ajo Fod <aj...@gmail.com>.

Ted: Thanks for the pointer. I've been looking for a replacement for cold
for a while because it doesn't allow the merge operation for QuantileBin1D
and as you pointed out it is a generally messy library. stream-lib could be
the replacement I"m looking for ... if it passes other requirements. Will
keep you posted.

Phil: Looks like deriving the KS-stats from first principles may be the
best way. IMHO, it is the distribution of extrema of binomial distributions
at each discrete quantile of m (best if m < n). That extrema is probably
the most efficient to compute with MC simulations. Its an interesting
problem, so best of luck.

I'm curious about the application that lead to needing the test for small
samples.

Cheers,
-Ajo

On Sat, Aug 10, 2013 at 1:26 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sat, Aug 10, 2013 at 8:59 AM, Ajo Fod <aj...@gmail.com> wrote:
>
> > If the data doesn't fit, you probably need a StorelessQuantile estimator
> > like QuantileBin1D from the colt libraries. Then pick a resolution and do
> > the single pass search.
> >
>
> Peripheral to the actual topic, but the Colt libraries are out of date in
> almost every respect.  When we added unit tests, even the most basic
> functions turned up dozens of serious bugs.  With respect to more advanced
> estimation such as quantiles, nothing in Colt comes close to streamlib.
>  Even the Mahout on-line estimators are generally superior.
>
> QuantileBin1D, in particular, lacks the machinery of QDigests (not
> suprising since they were published in 2004, long after Colt went dormant).
>  Check out
>
>
> https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/QDigest.java
>
> and the original paper
>
> http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
>

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Aug 10, 2013 at 8:59 AM, Ajo Fod <aj...@gmail.com> wrote:

> If the data doesn't fit, you probably need a StorelessQuantile estimator
> like QuantileBin1D from the colt libraries. Then pick a resolution and do
> the single pass search.
>

Peripheral to the actual topic, but the Colt libraries are out of date in
almost every respect.  When we added unit tests, even the most basic
functions turned up dozens of serious bugs.  With respect to more advanced
estimation such as quantiles, nothing in Colt comes close to streamlib.
 Even the Mahout on-line estimators are generally superior.

QuantileBin1D, in particular, lacks the machinery of QDigests (not
suprising since they were published in 2004, long after Colt went dormant).
 Check out

https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/QDigest.java

and the original paper

http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf

Re: [math] Kolmogorov-Smirnov 2-sample test

Posted by Ajo Fod <aj...@gmail.com>.

This depends on data size. If it fits in memory, a single pass through the
sorted array to find the biggest differences would suffice.

If the data doesn't fit, you probably need a StorelessQuantile estimator
like QuantileBin1D from the colt libraries. Then pick a resolution and do
the single pass search.

Cheers,
-Ajo


On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <ph...@gmail.com> wrote:

> I am working on MATH-437 (turning K-S distribution into a proper K-S
> test impl) and have to decide how to implement 2-sample tests.
> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
> K-S distribution, so for large samples just using the cdf we already
> have is appropriate.  For small samples (actually for any size
> sample), the test statistic distribution is discrete and can be
> computed exactly.  A brute force way to do that is to enumerate all
> of the n-m partitions of {0, ..., n+m-1} and compute all the
> possible D_n,m values.  R seems to use a more clever way to do
> this.  Does anyone have a reference for an efficient way to compute
> the exact distribution, or background on where R got their
> implementation?
>
> Absent a "clever" approach, I see three alternatives and would
> appreciate some feedback:
>
> 0) just use the asymptotic distribution even for small samples
> 1) fully enumerate all n-m partitions and compute the D_n,m as above
> 1) use a monte carlo approach - instead of full enumeration of the
> D_n,m, randomly generate a large number of splits and compute the
> p-value for observed D_n,m by computing the number of random n-m
> splits generate a D value less than what is observed.
>
> Thanks in advance for any feedback / pointers.
>
> Phil
>
> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>