You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2023/04/05 12:27:00 UTC
[jira] [Resolved] (STATISTICS-69) Add an unconditioned exact test for 2x2 contingency tables

     [ https://issues.apache.org/jira/browse/STATISTICS-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Herbert resolved STATISTICS-69.
------------------------------------
    Resolution: Implemented

Added in commit:

60b26ebca6a46f80651074daf895910051092c65

> Add an unconditioned exact test for 2x2 contingency tables
> ----------------------------------------------------------
>
>                 Key: STATISTICS-69
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-69
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: inference
>            Reporter: Alex Herbert
>            Priority: Minor
>             Fix For: 1.1
>
>
> A 2x2 contingency table [[a, b], [c, d]] is used to visualize N independent observations of two binary variables (G or g and H or h):
>  
> {noformat}
>     G g
>   -------
> H | a b | m
> h | c d | n
>   ----------
>     s r | N{noformat}
> The probability distributions are classified into 3 cases:
>  # The row and column sums are fixed in advance. All table entries are determined by a. This follows a hypergeometric distribution with parameters N, m, s.
>  # The row sums are fixed, but the column sums are not. All table entries are determined by a and c. The distribution is a joint binomial distribution with probabilities p0 and p1:
> a ~ B(m, p0); c ~ B(n, p1)
>  # Only the total N is fixed (row and columns sums are not). The table (a, b, c, d) is a multinomial distribution.
> Case 1 is covered by using Fisher's exact test (see [STATISTICS 64]). It does not occur in practice very often as the column and row sums are both fixed in advance. This is an exact conditioned test (as it conditions on the row sums).
> Case 2 is more common where the row sums are fixed but the columns are not. For example a clinical trial with two groups of fixed size (e.g. medication or placebo); the outcome of cure or no cure for each of the patients is unknown.
> Case 3 is rare. For example flipping two coins N times and totalling the heads/tails for each independently.
> I propose adding a test that can handle an unconditioned exact test. Case 2 is the more common and simpler to support. It involves generating a test statistic for each possible table given the fixed totals. The p-value is obtained from a subset of the possible test statistics that are more extreme that the observed table. Alternatively the subset is maximised by incrementally adding candidates based on which next sized subset has the smallest p-value. This is the CSM (Convexity, Symmetry, Minimization) test of Barnard (1945). This is computational expensive and benefits from precomputed tables which ranks the order of tables for a given size (m,n). In either case the computation of the p-value involves maximising the p-value given a nuisance parameter in the range (0, 1).
> Possible test statistics are Fisher's p-value for the table (known as Boschloo's test (1970)), or using a Z-pooled or Z-Unpooled statistic. Implementation of the CSM test is computationally intense.
> There is a reference implementation in R as the Exact package:
> [https://cran.r-project.org/web/packages/Exact/Exact.pdf]
> SciPy has implementation of Boshloo's and the z-pooled/unpooled test (which they name Barnard's test):
> [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boschloo_exact.html]
> [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.barnard_exact.html]
> Note that the search for the nuisance parameter involves a univariate function with multiple minima. The implementations in R and SciPy both use multiple start points to find candidate locations for a search for a maxima. This is done by using N uniform points in (0, 1) and then (optionally) optimising the best candidate to find the maximum. The function requires numerical differentiation and would be suitable for a non-derivative method such as Brent optimisation for the univariate case.
> See also:
> [https://en.wikipedia.org/wiki/Boschloo%27s_test]
> [https://en.wikipedia.org/wiki/Barnard%27s_test]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)