You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2021/08/21 14:02:00 UTC
[jira] [Commented] (MATH-1627) ChiSquareTest computes NaN with zero observations

    [ https://issues.apache.org/jira/browse/MATH-1627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402614#comment-17402614 ] 

Alex Herbert commented on MATH-1627:
------------------------------------

If there is at least 1 observation (so total is non-zero) this issue still manifests if any row or column in the input contingency table sums to zero due to the division by the expected value:
{code:java}
double sumSq = 0.0d;
for (int row = 0; row < nRows; row++) {
    for (int col = 0; col < nCols; col++) {
        // *** Will create NaN if expected is zero ***
        double expected = (rowSum[row] * colSum[col]) / total;
        sumSq += ((counts[row][col] - expected) *
                (counts[row][col] - expected)) / expected;
    }
}
{code}
In this case two options are available:
 # Raise an exception if any column or row total is zero
 # Ignore the rows and columns from the sum of squared deviations

Option 2 may be more robust. The columns that are ignored still contribute to the degrees of freedom of the test:
{code:java}
long[][] count = ...;
double df = ((double) counts.length -1) * ((double) counts[0].length - 1);
{code}
Note that R will compute NaN for the chi-square statistic in this case but is OK if all columns and rows have at least 1 count above 0:
{code:r}
> m <- array(c(1,0,1,0), dim = c(2,2))
> chisq.test(m)

	Pearson's Chi-squared test

data:  m
X-squared = NaN, df = 1, p-value = NA

Warning message:
In chisq.test(m) : Chi-squared approximation may be incorrect

> m <- array(c(1,0,0,1), dim = c(2,2))
> chisq.test(m)

	Pearson's Chi-squared test with Yates' continuity correction

data:  m
X-squared = 0, df = 1, p-value = 1

Warning message:
In chisq.test(m) : Chi-squared approximation may be incorrect
{code}
Thus R is detecting if all observations are zero (and raising an error) but will not raise an error if rows or columns are zero, instead computing NaN.

If the rows/columns are ignored (i.e. if expected=0, sum of square deviations=0) then chi-square for this is 0.0. This is true for a 2x2 table irrespective of the number of counts in a single entry:
{noformat}
[400, 0]
[0, 0]

chi2 = 0.0
{noformat}
This effectively ignores rows/columns. Ignoring columns/rows is reducing the degrees of freedom (DoF) of the chi-square statistic. In the 2x2 case it reduces the DoF to a level that is invalid.

This could be formalised in the class to ignore any column or row with all zero entries. With the current code:
{code:java}
ChiSquareTest chi2Test = new ChiSquareTest();

long[][] counts = {{1, 2}, {3, 4}};
System.out.println(chi2Test.chiSquare(counts));

long[][] counts2 = {{1, 0, 2}, {0, 0, 0}, {3, 0, 4}};
System.out.println(chi2Test.chiSquare(counts2));
{code}
Outputs:
{noformat}
0.07936507936507939
NaN
{noformat}
If the rows and columns are ignored when the expected value is zero then the chi-square is computed:
{noformat}
0.07936507936507939
0.07936507936507939
{noformat}
This effectively removes from the input contingency table any rows or columns with no data. The degrees of freedom for the chi-square distribution would have to be updated. This adds complexity to the class. It is preferable to leave it to the caller to eliminate categories from the contingency table data that have no observations.

A suggested fix is:

Raise a ZeroException if all input counts are zero

Raise a ZeroException if any column or row total is zero

The documentation should be updated to make the user aware this is possible. This type of input is invalid for the chi-square computation.

 

> ChiSquareTest computes NaN with zero observations
> -------------------------------------------------
>
>                 Key: MATH-1627
>                 URL: https://issues.apache.org/jira/browse/MATH-1627
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 4.0
>            Reporter: Alex Herbert
>            Priority: Trivial
>
> Zero observations input to the ChiSquareTest will compute NaN:
> {code:java}
> ChiSquareTest chi2Test = new ChiSquareTest();
> final long[][] counts = new long[2][2];
> // NaN
> double chi2 = chi2Test.chiSquare(counts);
> {code}
> This is due to a divide by zero error. This bug was identified by sonarcloud analysis.
> The unit tests use R as a reference. In R this case will raise an error that at least one entry must be positive. Setting a value to 1 allows R to compute a Chi-square test value but the value is not valid:
> {code:r}
> > m <- array(c(1,0,0,0), dim = c(2,2))
> > chisq.test(m)
> 	Pearson's Chi-squared test
> data:  m
> X-squared = NaN, df = 1, p-value = NA
> Warning message:
> In chisq.test(m) : Chi-squared approximation may be incorrect
> {code}
> Other methods in the ChiSquareTest will raise a ZeroException if the observations are zero for an entire array of observations or if a pair of observations in a bin are both zero.
> The Chi square test has assumptions that do not hold when the number of observations are small. The limit for the number of observations per category is variable. The document referenced in the code javadoc recommends an expected level of 5 per bin. To avoid setting limits on the sample size a suggested fix is to raise a zero exception if the sum of all counts is zero. This will avoid a NaN computation. Use of a suitable number of observations is left to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)