You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by "Anders Conbere (JIRA)" <ji...@apache.org> on 2014/08/11 20:28:13 UTC

[jira] [Commented] (MATH-1140) Incorrect result from MannWhitneyUTest#mannWhitneyUTest with large datasets

    [ https://issues.apache.org/jira/browse/MATH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093104#comment-14093104 ] 

Anders Conbere commented on MATH-1140:
--------------------------------------

I found my actual source of the issue I'm experiencing which has to do with an integer overflow when calculating U1 in mannWhitneyU and multiplying array lengths together. Since array lengths are ints this imposes a pretty tiny maximum size to the length of your array inputs Math.sqrt(Integer.MAX_VALUE). I would recommend casting those into longs or doubles to improve usability or asserting the maximum length of the arrays early on.

> Incorrect result from MannWhitneyUTest#mannWhitneyUTest with large datasets
> ---------------------------------------------------------------------------
>
>                 Key: MATH-1140
>                 URL: https://issues.apache.org/jira/browse/MATH-1140
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Anders Conbere
>            Priority: Minor
>
> On large datasets MannWhitneyUTest#mannWhitneyUTest returns the double value 0.0 instead of the correct p-value. I suspect this is an overflow but haven't been able to trace it down yet.
> I'm afraid I'm not very good at java, but I'm including a link to a public repository where you can reproduce the issue, unfortunately my implementation is written in clojure.
> https://github.com/aconbere/apache-commons-mann-whitney-bug
> The summary is that by calling MannWhitneyUTest#mannWhitneyUTest with two randomly generated arrays (50k elements with a max value of 300) I can reliably reproduce the result 0.0. By reducing that to something more modest  like 2k I get correct p-value calculations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)