You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Matt Adereth <ad...@gmail.com> on 2013/11/08 13:35:01 UTC

[math] Inconsistent handling of insufficient data when computing correlations

While writing the test cases for KendallsCorrelation, I discovered an
interesting behavior with SpearmansCorrelation that might be considered an
inconsistency.  SpearmansCorrelation.correlate() throws
MathIllegalArgumentException if the array length is less than 2, but
returns Double.NaN if the array contains multiple copies of a single value.

This seems inconsistent with how insufficient data is handled elsewhere in
Apache Commons Math.

In the User Guide for SimpleRegression it says:

> When there are fewer than two observations in the model, or when there is
no variation in the x values (i.e. all x values are the same) all
statistics return NaN. At least two observations with different x
coordinates are required to estimate a bivariate regression model.

Similarly, all the UnivariateStatistics return Double.NaN when there isn't
enough data.

When I'm computing various statistics on multiple datasets, it seems
unnecessarily cumbersome to specially handle an exception for statistic and
NaNs for the others.  I propose that PearsonsCorrelation and
SpearmansCorrelation should return NaN if there is insufficient data,
whether it be from not enough observations (< 2) or not enough unique
values.

Re: [math] Inconsistent handling of insufficient data when computing correlations

Posted by Gilles <gi...@harfang.homelinux.org>.
On Fri, 8 Nov 2013 07:35:01 -0500, Matt Adereth wrote:
> While writing the test cases for KendallsCorrelation, I discovered an
> interesting behavior with SpearmansCorrelation that might be 
> considered an
> inconsistency.  SpearmansCorrelation.correlate() throws
> MathIllegalArgumentException if the array length is less than 2, but
> returns Double.NaN if the array contains multiple copies of a single 
> value.
>
> This seems inconsistent with how insufficient data is handled 
> elsewhere in
> Apache Commons Math.
>
> In the User Guide for SimpleRegression it says:
>
>> When there are fewer than two observations in the model, or when 
>> there is
> no variation in the x values (i.e. all x values are the same) all
> statistics return NaN. At least two observations with different x
> coordinates are required to estimate a bivariate regression model.
>
> Similarly, all the UnivariateStatistics return Double.NaN when there 
> isn't
> enough data.
>
> When I'm computing various statistics on multiple datasets, it seems
> unnecessarily cumbersome to specially handle an exception for 
> statistic and
> NaNs for the others.  I propose that PearsonsCorrelation and
> SpearmansCorrelation should return NaN if there is insufficient data,
> whether it be from not enough observations (< 2) or not enough unique
> values.

At first sight, I'd rather expect that an identified problem (such as
"insufficient data") would raise an appropriate exception, whereas NaN
could result from other problems.


Regards,
Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] Inconsistent handling of insufficient data when computing correlations

Posted by Phil Steitz <ph...@gmail.com>.
On 11/8/13 1:27 PM, Phil Steitz wrote:
> On 11/8/13 4:35 AM, Matt Adereth wrote:
>> While writing the test cases for KendallsCorrelation, I discovered an
>> interesting behavior with SpearmansCorrelation that might be considered an
>> inconsistency.  SpearmansCorrelation.correlate() throws
>> MathIllegalArgumentException if the array length is less than 2, but
>> returns Double.NaN if the array contains multiple copies of a single value.
> The latter sounds like a bug, assuming you are using the default
> NaturalRanking rank transform.  Ties should be averaged and handled
> correctly in this case.  Please open a JIRA, ideally with test case
> for this.

Does not actually look like a bug, at least I have not been able to
reproduce it.  You do get NaN when there are not at least two
distinct values in the x array (the first array to be correlated). 
That does need to be documented (as it is in SimpleRegression).

Phil
>
>> This seems inconsistent with how insufficient data is handled elsewhere in
>> Apache Commons Math.
> Good point.  I think there is justification for the different
> behavior here though.  SimpleRegression and the univariate stats are
> mutable, maintaining a dataset that can be added to, with stats
> queried at any point.  So while in theory, getSlope() in
> SimpleRegression could throw IllegalStateException (IAE not really
> appropriate here) when there is not enough data in the model, its
> documented behavior in this case is to return NaN.  The key is to
> clearly document the behavior.  SimpleRegression does this well, the
> correlation classes not so much.  Patches welcome to improve the
> documentation of preconditions and behavior of these classes.  I
> would be OK with changing the correlation classes to return NaNs in
> place of throwing IAE on insufficient data; but this change should
> happen in a major release (i.e. wait for 4.0).
>
> Phil
>
>  
>> In the User Guide for SimpleRegression it says:
>>
>>> When there are fewer than two observations in the model, or when there is
>> no variation in the x values (i.e. all x values are the same) all
>> statistics return NaN. At least two observations with different x
>> coordinates are required to estimate a bivariate regression model.
>>
>> Similarly, all the UnivariateStatistics return Double.NaN when there isn't
>> enough data.
>>
>> When I'm computing various statistics on multiple datasets, it seems
>> unnecessarily cumbersome to specially handle an exception for statistic and
>> NaNs for the others.  I propose that PearsonsCorrelation and
>> SpearmansCorrelation should return NaN if there is insufficient data,
>> whether it be from not enough observations (< 2) or not enough unique
>> values.
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] Inconsistent handling of insufficient data when computing correlations

Posted by Phil Steitz <ph...@gmail.com>.
On 11/8/13 4:35 AM, Matt Adereth wrote:
> While writing the test cases for KendallsCorrelation, I discovered an
> interesting behavior with SpearmansCorrelation that might be considered an
> inconsistency.  SpearmansCorrelation.correlate() throws
> MathIllegalArgumentException if the array length is less than 2, but
> returns Double.NaN if the array contains multiple copies of a single value.

The latter sounds like a bug, assuming you are using the default
NaturalRanking rank transform.  Ties should be averaged and handled
correctly in this case.  Please open a JIRA, ideally with test case
for this.

>
> This seems inconsistent with how insufficient data is handled elsewhere in
> Apache Commons Math.

Good point.  I think there is justification for the different
behavior here though.  SimpleRegression and the univariate stats are
mutable, maintaining a dataset that can be added to, with stats
queried at any point.  So while in theory, getSlope() in
SimpleRegression could throw IllegalStateException (IAE not really
appropriate here) when there is not enough data in the model, its
documented behavior in this case is to return NaN.  The key is to
clearly document the behavior.  SimpleRegression does this well, the
correlation classes not so much.  Patches welcome to improve the
documentation of preconditions and behavior of these classes.  I
would be OK with changing the correlation classes to return NaNs in
place of throwing IAE on insufficient data; but this change should
happen in a major release (i.e. wait for 4.0).

Phil

 
>
> In the User Guide for SimpleRegression it says:
>
>> When there are fewer than two observations in the model, or when there is
> no variation in the x values (i.e. all x values are the same) all
> statistics return NaN. At least two observations with different x
> coordinates are required to estimate a bivariate regression model.
>
> Similarly, all the UnivariateStatistics return Double.NaN when there isn't
> enough data.
>
> When I'm computing various statistics on multiple datasets, it seems
> unnecessarily cumbersome to specially handle an exception for statistic and
> NaNs for the others.  I propose that PearsonsCorrelation and
> SpearmansCorrelation should return NaN if there is insufficient data,
> whether it be from not enough observations (< 2) or not enough unique
> values.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org