Full article: Type I error convergence of three hypothesis tests for small RxC contingency tables

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Many statistical software packages provide hypothesis tests for the independence or association between the rows and columns of a $R \times C$ contingency table. However, insufficient guidance is available about the accuracy and domains-of-validity of some tests that are based on assumptions or approximations. This paper assesses the accuracy of the p-values for three tests of small $R \times C$ tables using Monte Carlo simulations and quantile regression. Results for the Fisher-Freeman-Halton exact tests indicate the p-value accuracy depends on the number of possible unique nominal probabilities. Results for the Pearson chi-square test indicate the p-value accuracy depends on the number of possible unique nominal probabilities and the expected cell counts. Results for the Goodman-Kruskal gamma test indicate the p-value accuracy depends on the number of possible unique ordinal probabilities, the expected cell counts, and the total cell counts. Empirical models for the accuracy of these tests and recommended domain-of-validity criteria are provided.

KEYWORDS:

Public Interest Statement

Many scientific, technical, and socio-economic fields use contingency tables to evaluate the observed relationships between two or more factors. Statistical tests are widely used to determine whether or not the factors are independent (i.e. no relationship), and whether or not any association between the factors is statistically significant. The results of these tests may affect public policy and legal decisions (e.g. energy, environmental, health, and safety). However, these tests may be based on assumptions and approximations that can affect the accuracy of the test results. Existing guidelines for the use of these tests are often imprecise, inconsistent, subjective, or non-existent. This paper attempts to objectively estimate the accuracy of three exemplar tests and proposes test domain-of-validity criteria based on the estimated accuracy of the tests, which can assist decision-making. The methods developed in this paper can also be applied to other statistical tests and critical test threshold levels.

1. Introduction

Many statistical software packages provide hypothesis tests for the independence (e.g. Pearson chi-square) or association (e.g. Goodman-Kruskal gamma) between the rows and columns of a $R \times C$ contingency table. These tests are often based on assumptions or approximations that may limit their domain-of-validity or accuracy. However, it is unclear from the existing literature and statistical software packages what these limitations are.

This paper assesses the domains-of-validity and accuracy of the Fisher-Freeman-Halton exact (Fisher, Citation1934; Freeman & Halton, Citation1951; Mehta & Patel, Citation1986), the Pearson chi-square (Fleiss, Citation1981), and the Goodman and Kruskal gamma (Brown & Benedetti, Citation1977; Goodman & Kruskal, Citation1979) test p-values for contingency tables up to $4 \times 4$ and $2 \times 5$ versus indices based on the table dimensions and the row and column margin totals. These three tests were evaluated because they are well known and implemented in several statistical software packages, and the results of which are commonly reported and interpreted by a wide audience. The Fisher–Freeman–Halton test was also included because it is based on the exact reference distribution instead of an approximation.

1.1. Background

1.1.1. R ✕ C contingency tables

Two-dimensional $R \times C$ contingency tables list the frequencies of two discrete data variable value combinations. Given row and column variables with $R$ and $C$ discrete values, let $n_{i j}$ represent the number of times the row variable is equal to the $i$ ’th discrete value and column variable is equal to the $j$ ’th discrete value. The result is a table with $R$ rows and $C$ columns. The $i$ ’th row and $j$ ’th column margin totals are denoted by $r_{i} \overset{Δ}{=} n_{i .}$ and $c_{j} \overset{Δ}{=} n_{. j}$ respectively, where a dot in the subscript denotes summation. The overall total is denoted by $n_{. .}$ . The expected cell count is ${\hat{n}}_{i j} = r_{i} c_{j} / n_{. .}$ .

There are several different tests for the statistical significance of a $R \times C$ table. The appropriateness and accuracy of these tests depends on the type of data, sampling method, sample size, and the null hypothesis ( $H_{0}$ ) to be tested. The null hypothesis for the Fisher-Freeman-Halton exact and Pearson chi-square tests is the row and column variables are independent or have equal underlying probabilities. The null hypothesis for the Goodman and Kruskal’s gamma test is no association between the row and column variables. Many of these tests are supported by statistical software packages, such as SAS (Citation2013), SPSS (Citation2009), R (R Core Team, Citation2017; Ekstrøm, Citation2020), and MATLAB (MathWorks, Citation2016; Cardillo, Citation2007a).

1.1.2. Problem statement

We have observed that the Goodman-Kruskal gamma test results for association for some contingency tables appear to be inconsistent with the Pearson chi-square or exact test results for independence (or equal underlying probabilities). For example, consider the $2 \times 3$ contingency table in . The results of three hypothesis tests for this $2 \times 3$ table are listed in . The p-values for the Fisher-Freeman-Halton exact and Pearson chi-square tests are much greater than 0.05, but the p-value for the Goodman-Kruskal gamma test is much less than 0.05.Footnote¹ Therefore, these results indicate that there is a statistically significant association between the treatment and outcome even though the null hypothesis for independence or equal underlying probabilities cannot be rejected.Footnote²

Table 1. Example $2 \times 3$ contingency table

Display Table

Table 2. Hypothesis test results for

Display Table

Existing test guidelines were not sufficient in understanding these apparent inconsistencies (e.g. ). Some recommended criteria for the Pearson chi-square test are that all ${\hat{n}}_{i j}$ should be greater than 5, 10, or even 20. Cochran [Cochran, Citation1952, p. 328], however, commented that the “numbers 10 and 5 appear to have been arbitrarily chosen”. The SPSS CROSSTAB Pearson chi-square test results for the Example in included a footnote “3 cells (50.0%) have expected count less than 5. The minimum expected count is 1.32”. The R chisq.text function presents a warning message for the same table that the “Chi-squared approximation may be incorrect”. However, Cochran indicates in [Cochran, Citation1954, p. 418] that there are a number of special cases where the $χ^{2}$ approximation may be adequate if some ${\hat{n}}_{i j} < 5$ . There is a large body of work that discusses guidelines for using the Pearson chi-square test. Tate and Hyer cited a wide range of opinions in [Tate and Hyer, Citation1973, p. 836] regarding these guidelines, which are summarized . Koehler and Larntz observed that “for the null hypothesis of symmetry, the chi-squared approximation for the Pearson statistic is quite adequate at the .05 and .01 nominal levels for expected frequencies as low as .25 when $k \geq 3$ , $n \geq 10$ , $n^{2} / k \geq 10$ ” [Koehler and Larntz, Citation1980, p. 343].Footnote³

Table 3. Example Pearson Chi-Square test recommendations summarized by Tate and Hyer (Tate and Hyer, Citation1973, p. 836)

Display Table

The guidelines for the Goodman-Kruskal gamma test are almost non-existent. Ott et al. indicates, “Although no specific requirement for sample size is generally given, we should never use this test when $n [. .] < 10$ ” [Ott et al., Citation1992, p. 399]. The accuracy of some of these tests for special cases were previously investigated using Monte Carlo methods in (Craddock & Flood, Citation1970; Göktaş & İşçi, Citation2011), however clear recommendations regarding the domain-of-validity of the tests were not given.

Therefore, the objective of this study was to assess the domain-of-validity and accuracy of the Fisher–Freeman–Halton, Pearson chi-square, and Goodman-Kruskal gamma tests in order to determine generally applicable guidelines for their use. It was not our objective to evaluate other tests or adjustments, such as the mid p-value (Agresti, Citation2002) or Ehwerhemuepha (Ehwerhemuepha et al., Citation2019), which may be more suitable for a given statistical problem.

1.2. Paper organization

The remainder of this paper is organized into five sections. The methods used to are described in Section 2. The results are described in Section 3. Limitations of this analysis are described in Section 4. Recommendations and conclusions are in Sections 5 and 6.

This paper is also accompanied by seven appendices and other supplementary material that are available online. A detailed description of the Monte Carlo simulation method and example results are in Appendix A. Supporting algorithms are described in Appendix B. Empirical envelope model search results are summarized in Appendix C. Supporting methods and results are in Appendix D. Example recommended accuracy criteria calculations are in Appendix E. MATLAB run scripts and functions to reproduce the results herein are described in Appendix F. R run scripts and functions to reproduce the example domain-of-validity and accuracy criteria results herein are described in Appendix G.

2. Methods

The evaluation of the hypothesis test domain-of-validity and accuracy was accomplished in two main steps. First, the 5th percentile p-values for each of the hypothesis tests were estimated for all combinations of $R \times C$ tables within a simulation domain using Monte Carlo simulations (e.g. [Kroese and Chan, Citation2014, Chapter 7]). This percentile p-value corresponds to the intended critical p-value for the test (e.g. 0.05).Footnote⁴ Envelope models to quantify the range of observed 5th percentile p-values based on the table dimensions and row and column margin totals were then determined.

2.1. Monte Carlo simulations

Monte Carlo simulations were used to estimate the 5th percentile p-values ( $p_{0.05})$ for contingency tables with independent row and column variables and fixed margin totals (i.e. Method II sampling [Fleiss, Citation1981, p.27]), for each of the hypothesis tests described in Subsection 2.1.1. This involved first generating a set of randomly sampled contingency tables with independent row and column variables and fixed $r_{i}$ and $c_{j}$ margin totals. Then, for each hypothesis test: the p-values for each randomly sampled table were calculated; the empirical distribution of the p-values was determined; and the 5th percentile p-value was calculated; as depicted by the examples in . This process is described in more detail in Appendix A. All of the results in this paper were calculated using MATLAB version 9.0 (R2016a) with the Statistics and Machine Learning Toolbox version 10.2 (R2016a) (MathWorks, Citation2016).

Figure 1. Distributions of observed hypothesis test p-values for a random sample of $2 \times 3$ tables with independent row and column variables, $r = [39, 11]$ and $c = [35, 9, 6]$

Figure 1. Distributions of observed hypothesis test p-values for a random sample of 2×3 tables with independent row and column variables, r=[39,11] and c=[35,9,6]

Example cumulative probability distributions of observed p-values from three hypothesis tests of randomly sampled tables are shown in . The vertical and horizontal scales of the P–P plots in this figure have been transformed by the cube root of the numerical values to place more emphasis on the smaller probability values that are of interest for hypothesis testing. The cumulative probabilities of the observed values were estimated by the mean rank [Electronics Reliability Subcommittee, Citation1987, pp.83–84].

Ideally the distribution of calculated p-values for each hypothesis test would be uniformly distributed between 0 and 1 if the null hypothesis is true. Therefore the $α$ ’th percentile p-value for each test would ideally be $α / 100 %$ , and all of the $\circ$ symbols in would ideally be on the diagonal dotted lines. For example, $p_{0.05}$ would ideally be 0.05. The deviation in the observed percentile p-values from the ideal values, as depicted by the $♢$ symbols in and the solid horizontal line, is the main subject of this analysis.

There are at least two sources of deviation depicted in . The first is a staircase effect associated with the discrete number of possible p-values for tables with the same row and column margin totals. This staircase effect is most apparent in Sub for the Fisher-Freeman-Halton exact test, where the sizes of the adjacent steps are about the same, but the overall size trend is increasing, and there are no other sources of error such as asymptotic approximations. The sum of the horizontal step lengths is 1. Therefore, the horizontal length of each step, which represents the minimum and maximum deviation in the observed p-values from the ideal values, tends to be inversely proportional to the number of possible unique p-values (e.g. $| N_{r c} |$ ). The steps for the Fisher-Freeman-Halton test are also to the right of the diagonal line, indicating the observed p-values tend to be equal or greater than the ideal p-value, and therefore the test tends to be conservative.

The second source of deviation from the ideal p-value, which is present in Sub , but not Sub , is due to the inaccuracy of the assumed reference distribution or asymptotic approximation. This effect is most apparent in Sub , where the calculated p-values between 0.004 and 0.07 tend to be less than the ideal values, based on the assumption that $z_{i} = G / A S E_{i}$ is normally distributed (see Appendix D.3), and therefore tends to be liberal. Furthermore the extreme tails of the calculated p-value distribution are also truncated in the other direction (i.e. tends to be conservative) because $- 1 \leq G \leq 1$ . This later condition occurred for one of the observed values in Sub where $G = - 1$ , $A S E_{0} = 0.2785$ , and the resulting p-value is 0.000330 (see the first row in Table A1). The example results in Sub suggest that the discrete staircase and asymptotic approximation effects on the Pearson chi-square test p-values have similar magnitudes, and that the overall trend tends to be unbiased.

P-value distributions were calculated for all combinations of the row and column margin totals (i.e. $r_{i}$ and $c_{j}$ ) within the simulation domain summarized in . This comprised all $R \times C$ tables with $n_{. .}$ between the minimum and maximum values listed in columns 3 and 4, inclusive. The $R \times C$ table dimensions comprised $2 \leq R \leq C \leq 4$ plus $2 \times 5$ tables. Tables with $R > C$ dimensions were not included because the p-values for the hypothesis tests considered herein are same as the transposed table. The $2 \times 5$ tables were included to better elucidate potential independent effects of table dimensions and degrees-of-freedom.

Table 4. $R \times C$ contingency table simulation domain

Display Table

The simulation method and example results are described in more detail in Appendix A.

2.1.1. Hypothesis test P-values

P-values for each of the following hypothesis tests were calculated for each unique table:

• Fisher and Fisher-Freeman-Halton exact tests. P-values for $2 \times 2$ tables were calculated using the fishertest function in the MATLAB statistics toolbox. P-values for $2 \times 3$ , $2 \times 4$ , and $3 \times 3$ tables were calculated using Cardillo’s myfisher23, myfisher24, and myfisher33 functions (Cardillo, Citation2007a, Citation2007b and Citation2007c).

• Pearson chi-square test. This test assumes that $X^{2} = \sum_{i = 1}^{R} \sum_{j = 1}^{C} {(n_{i j} - {\hat{n}}_{i j})}^{2} / {\hat{n}}_{i j}$ has a $χ^{2}$ distribution with $ν = (R - 1) (C - 1)$ degrees-of-freedom if $H_{0}$ is true. The p-value is the probability of the observed or larger $X^{2}$ value if $H_{0}$ is true, which was calculated using the chi2cdf function in the MATLAB statistics toolbox.

This assumption is one of convenience because the “ $X^{2}$ statistic has approximately a chi-square distribution for large samples sizes. It is difficult to specify what ‘large’ means, but $\{[{\hat{n}}_{i j}] \geq 5\}$ is sufficient” [Agresti, Citation1996, p. 28]. The accuracy of the p-value resulting from this assumption and approximation is one of the subjects of this study.

• Goodman-Kruskal’s gamma test. There is more than one version of the Goodman-Kruskal gamma test. The original version proposed by Goodman and Kruskal is described in Appendix D.3.2. A modified version based on (Brown & Benedetti, Citation1977), which is implemented in SAS [SAS, Citation2013, p. 160] and SPSS [SPSS, Citation2010, p. 161], is described in Appendix D.3.1. This test assumes that $z_{i} = G / A S E_{i}$ has a normal distribution with zero mean and unit standard deviation if $H_{0}$ , where $G$ and $A S E_{i}$ are calculated according to the equations in Appendix D.3. The subscript $i$ denotes the modifed ( $0$ ) or original ( $3$ ) version of the test. The p-value is the probability of the observed or larger $| z_{i} |$ value if $H_{0}$ is true, which was calculated using the normcdf function in the MATLAB statistics toolbox.

This assumption is also one of convenience because according to (Goodman & Kruskal, Citation1979) (e.g. pages 78 and 93) and (Brown & Benedetti, Citation1977), $z_{0}$ actually has an approximate, asymptotically unit-normal distribution for large $n_{. .}$ . The $z_{i}$ value is approximately normally distributed because $G$ is limited to $- 1 \leq G \leq 1$ . The accuracy of the p-value resulting from this assumption and approximation is also one of the subjects of this study.

2.2. Candidate accuracy assessment envelope models

It is assumed herein that the accuracy of the hypothesis test p-values can vary depending on the contingency table dimensions and the row and column margin totals. Candidate envelope models and indices to quantify this relationship are described is this section.

2.2.1. Candidate quantile models

The accuracy of the hypothesis tests is described herein by the 2.5% and 97.5% quantiles of the observed $p_{0.05}$ values. These quantiles represent the lower bounds ( $L B$ ) and upper bounds ( $U B$ ) of a 95% interval enclosed by the envelope model. Note the 0.0% and 100% quantiles were not used because these quantiles are numerically ill-defined and non-unique. For example, one numerical solution for the 100% quantile of observed $p_{0.05}$ values is simply 1, which is not very descriptive of the upper bounds. It was also generally assumed that the $L B$ and $U B$ will tend to (e.g. asymptotically) converge to the ideal value ( $p_{i d e a l}$ ) (e.g. 0.05) as the candidate indices tend towards a limiting value. This limiting value can be either infinity or zero, depending on the type of index. Therefore, the following empirical model forms were considered for the $L B$ , $U B$ , and median quantiles:

(1)

Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin

(1)

(2)

Q = π (g (p_{i d e a l}) + s exp (λ_{0} + \sum_{i} λ_{i} log (x_{i}))) + \isin

(2)

where $π (a) \overset{Δ}{=} exp (a) / (1 + exp (a))$ , $g (b) \overset{Δ}{=} ln (b / (1 - b))$ , and $x_{i}$ are one or more candidate indices. The $q_{i, j}$ in EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) are assumed polynomial powers for $x_{i}$ . The $β_{i, j}$ and $\in$ are unknown coefficients to be estimated by quantile regression (e.g. (Grinsted, Citation2008)). The $s$ coefficient in EquationEquation (2)(2) $Q = π (g (p_{i d e a l}) + s exp (λ_{0} + \sum_{i} λ_{i} log (x_{i}))) + \isin$ (2) is either +1 or −1, and $λ_{i}$ and $\in$ are unknown coefficients to be estimated by quantile regression. The form of these equations guarantee that $0 < Q < 1$ for $x_{i} > 0$ and $Q \to p_{i d e a l}$ if all $x_{i} \to \infty$ and $q_{i, j} < 0$ . These two models are equivalent if $β_{1, 1} = exp (λ_{0})$ , $q_{1, 1} = λ_{1}$ , and all other $β_{i, j} = 0$ and $λ_{i} = 0$ .

The polynomial model form in EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) is well suited for fitting one or more indices simultaneously for assumed $q_{i}$ values. The assumed values for $q_{i}$ included [Citation1], [−0.5], [−1], [−1,-2], and [−1,-2,-3] depending on the test and candidate index. Higher degree polynomials can fit the data better than lower degree polynomials, but can result in overparameterized models, which is undesirable.

The logarithmic model form in EquationEquation (2)(2) $Q = π (g (p_{i d e a l}) + s exp (λ_{0} + \sum_{i} λ_{i} log (x_{i}))) + \isin$ (2) is only applicable to a single index but can be used to estimate a scalar value for $q_{i}$ . These models tend to be more parsimonious, which is desirable.

2.2.2. Candidate accuracy assessment indices

Four types of candidate indices based on the table dimensions and margin totals were considered to assess different potential effects on p-value accuracy. The first three types of indices have positive values that tend to increase for tables with larger margin totals or overall counts. The last type of index has a value between 0 and 1.

2.2.2.1. Candidate discreteness indices

The following indices were considered to represent effects of the discrete p-value distribution resulting from the whole number table counts (e.g. as depicted by the staircase shaped curve in Sub ):

$|S_{r c}|$ The number of unique $R \times C$ tables with $r_{i}$ and $c_{j}$ marginal totals ( $| S |$ in this context denotes the cardinality of set $S$ , and $S_{r c}$ is the set of all unique $R \times$ C tables as defined in Appendix b.1) The rationale for this candidate index is that the distribution of sampled p-values depends on the maximum possible number of unique sampled p-values, which is less than or equal to $| S_{r c} |$ . An algorithm to calculate $| S_{r c} |$ given $R$ , $C$ , and the margin totals is in Appendix B.1. Other algorithms are described in (Diaconis & Gangolli, Citation1994; Gail & Mantel, Citation1977).

$| N_{r c} |$ The number of unique nominal probabilities for $R \times C$ tables with $r_{i}$ and $c_{j}$ marginal totals. The rationale for this index is that the distribution of sampled nominal (e.g. the Fisher-Freeman-Halton exact test) p-values depends on $| N_{r c} |$ . An algorithm to calculate $| N_{r c} |$ given $R$ , $C$ , and the margin totals is in Appendix B.2. Note that $| N_{r c} | \leq | S_{r c} |$ .

$| O_{r c} |$ The number of unique ordinal probabilities for $R \times C$ tables with $r_{i}$ and $c_{j}$ marginal totals. The rationale for this index is that the distribution of sampled ordinal (e.g. the Goodman-Kruskal gamma test) p-values depends on $| O_{r c} |$ . An algorithm to calculate $| O_{r c} |$ is in Appendix B.3. Note that $| O_{r c} | \leq | S_{r c} |$ .

2.2.2.2. Candidate approximation indices

The following indices were also considered for the Pearson chi-square and Goodman and Kruskal gamma tests to represent approximation accuracy and asymptotic standard error effects:

• The minimum expected cell count, which is ${\hat{n}}_{m i n} \overset{Δ}{=} {min}_{i = 1}^{R} ({min}_{j = 1}^{C} {\hat{n}}_{i j})$ . The rationale for this candidate index is that Fleiss [Fleiss, Citation1981, p. 25] and others have warned that the Pearson chi-square test may not be accurate if ${\hat{n}}_{m i n}$ is small, (e.g. ${\hat{n}}_{m i n} <$ 5, 10, or 20).

• The sum and the average of the $ν$ smallest expected cell counts, which are $\sum_{m i n}^{ν} \hat{n} \overset{Δ}{=} \sum_{k = m i n}^{ν} {\hat{n}}_{k}$ and $\sum_{m i n}^{ν} \hat{n} / ν$ , where ${\hat{n}}_{k}$ is the $k$ ’th smallest expected cell count. The rationale for these indices is that they can account for multiple small expected cell counts in higher dimension tables. These indices are equal to ${\hat{n}}_{m i n}$ for $2 \times 2$ tables.

• The harmonic mean and the root harmonic mean of the expected cell counts, which are ${\hat{n}}_{h m} \overset{Δ}{=} R C / (\sum_{i = 1}^{R} \sum_{j = 1}^{C} {\hat{n}}_{i j}^{- 1})$ , and ${\hat{n}}_{r h m s} \overset{Δ}{=} \sqrt{R C / (\sum_{i = 1}^{R} \sum_{j = 1}^{C} {\hat{n}}_{i j}^{- 2})}$ . The rationale for these indices is that they can also account for multiple small expected cell counts.

Scaled and offset versions of the candidate expected cell count indices ( $x_{i}$ ) were also considered (i.e. $A x_{i} + B$ ). The offset $B = 1$ was considered to avoid small numerical values which, when inverted by a negative $q_{i, j}$ power in EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) , become very large and can adversely affect the overall model fit. Scaling factors $A = ν$ and $A = R C$ were considered to address potential effects of different contingency table dimensions.

The scaling factor $A = 1 / f (X_{0.05}^{2}, ν)$ was also considered to potentially normalize the effects of different table dimensions on the Pearson chi-square p-values. This scaling factor and values are described in Appendix D.1.

2.2.2.3. Candidate sample size indices

The total table count $n_{. .}$ was also considered for the Pearson chi-square and Goodman and Kruskal gamma test envelope models. The rationale is that $n_{. .}$ appears in the gamma test based on $A S E_{0}$ but not the original test based on $A S E_{3}$ (see Equations D.3 and D.8). Therefore, the difference between the two gamma test p-values goes to zero if $n_{. .} \to 0$ . The scaled indices $ν n_{. .}$ , $n_{. .} / ν$ , $R C n_{. .}$ , and $n_{. .} / R C$ were also considered to address potential effects of different contingency table dimensions. Koehler and Larntz (Koehler & Larntz, Citation1980) also investigated the effects of sample size ( $n_{. .}$ ) and the number of cells ( $R C$ ) on the accuracy of the Pearson chi-square approximation.

2.2.2.4. Candidate P-value range indices

The minimum and maximum possible p-values for a given test and row and column margin totals may also be related to the domain-of-validity and the accuracy of the hypothesis test p-values. These indices are denoted by $min (p_{t e s t})$ and $max (p_{t e s t})$ . They can be calculated by the algorithms in Appendix B to calculate $| N_{r c} |$ and $| O_{r c} |$ .

In general $min (p_{t e s t})$ tends to approach 0 and $max (p_{t e s t})$ tends to approach 1, for tables with large numerical values. There are two special cases for the Fisher-Freeman-Halton exact test:

• $max (p_{F i s h e r}) = 1$ because the p-value for a given table is defined as the sum of the probabilities for all of the tables with the same $r_{i}$ and $c_{j}$ that are as extreme or more extreme as the given table [Fleiss, Citation1981, p. 25] and [MathWorks, Citation2016, p. 25–1564]. There will always be at least one extreme table for which this sum includes all of the possible tables.

• $min (p_{F i s h e r}) = 1$ if $| N_{r c} | = 1$ because the sum of the probabilities for the only unique probability is 1.0.Footnote⁵

This leads to the following proposed p-value domain-of-validity criterion:

If the critical p-value (e.g. 0.05) is less than $min (p_{t e s t})$ or greater than $max (p_{t e s t})$ then the row and column margin totals are outside the test p-value domain-of-validity for the critical p-value.

This criterion would be in addition to other possible test accuracy criteria to be identified.

The $max (p_{t e s t})$ indices were not considered for the candidate envelope models.

2.2.3. Quantile regression method

Quantile regressions were accomplished using the quantreg3.m function (see the Supplementary Online Material). This function finds the value for $b$ that minimizes fval, defined as

(3)

f v a l = \sum_{k}^{N} |{\tilde{y}}_{k} \{\begin{matrix} τ - 1 & i f {\tilde{y}}_{k} < 0 \\ τ & o t h e r w i s e \end{matrix}\}|,

(3)

where $\tilde{y} = F (X b) - y$ , $y$ is a $N \times 1$ matrix of $N$ observed dependent variable values (e.g. $p_{0.05}$ ), $X$ is a corresponding $N \times p$ matrix of $p$ corresponding independent variable values (e.g. $x_{i}^{- j}$ ), $b$ is an estimate of $β$ , and $τ$ is the desired quantile (e.g. 0.025 or 0.975), and $F$ is an output transformation (e.g. $F (∙) = π (∙ + g (p_{i d e a l}))$ ). The quantreg3 function is based on Grinsted’s quantreg.m function (Grinsted, Citation2008) with the following extensions:

(1) optional input to provide an initial guess for the estimated $β$ values ( $b_{0}$ ), otherwise it is assumed that $b_{0} = X^{- 1} y$ .

(2) optional input to provide an output transformation function $F$ , otherwise it is assumed that $F (∙) = ∙$ .

(3) optional output to return the fval value calculated by the fminsearch.m function.

The average difference between the observed and modeled quantile ( $\overset{ˉ}{f}$ ) was then calculated by dividing fval by the number of observations in the regression for candidate model comparison purposes.

The initial guesses for the $L B$ and $U B$ models were

(4)

b_{0} = {[\begin{matrix} X \\ I_{p \times p} \end{matrix}]}^{- 1} [\begin{matrix} y \\ 0_{p \times 1} \end{matrix}],

(4)

where $y_{k} = g (p_{0.05, k}) - g (p_{i d e a l})$ , $I_{p \times p}$ is an $p \times p$ identity matrix, and $0_{p \times 1}$ is a $p \times 1$ matrix of zeroes. The $p$ additional rows in EquationEquation (4)(4) $b_{0} = {[\begin{matrix} X \\ I_{p \times p} \end{matrix}]}^{- 1} [\begin{matrix} y \\ 0_{p \times 1} \end{matrix}],$ (4) helps to prevent large $b_{0}$ values if $X$ is ill-conditioned. The initial guess for the median model was the average of the estimated $β$ values for the $L B$ and $U B$ models.

In some cases, the quantile regression did not converge within the default number of iterations, possibly due to overparameterization. If this occurred then submodels with one or more dropped terms were evaluated, and the submodel yielding in the minimum fmin value was selected. In this case the estimated $β$ values for the missing terms were set to 0. As a final step, the quantile regression with all of the candidate terms was reattempted using the best estimated submodel as an initial guess. This was accomplished by the quantreg_s3.m function, which calls the quantreg3.m function (which was in turn based on Grinsted’s quantreg.m function (Grinsted, Citation2008)).

The 95% confidence intervals for the estimated model coefficients were estimated using the jackknife method described by Kahane [Kahane, Citation2012, pp. 64–66]. This involved classifying the $p_{0.05}$ observations into (e.g. 14) groups ( $k$ ) based on the $R \times C$ table dimensions and whether or not $n_{. .}$ was even or odd. The quantile regression model coefficients were then re-estimated by removing the data for one group $k$ at a time (i.e. $b_{- k}$ ). Pseudo estimates of the model coefficients ( $b_{k}$ ) were then determined based on $b_{- k} - b$ . The 95% confidence intervals were then estimated based on variation in $b_{k}$ . The rationale for the jackknife grouping is to address potential table dimension sampling effects, and potential effects of “even-size samples” mentioned in [Goodman and Kruskal, Citation1979, p. 363].

3. Results

Single index trends and envelope model search results for the 5th percentile Fisher-Freeman-Halton exact, Pearson chi-square, and Goodman-Kruskal gamma test p-values are presented in separate subsections.

3.1. Fifth percentile Fisher-Freeman-Halton exact test P-values

3.1.1. Single Index Trends

The results in compare the 5th percentile p-values from the Monte Carlo simulations of the Fisher and Fisher-Freeman-Halton exact tests to $| S_{r c} |$ , $| N_{r c} |$ , and $min (p_{F i s h e r})$ respectively. The results for four different $R \times C$ table dimensions are depicted by different symbol types as indicated in the legend. The number of observed $p_{0.05}$ values for each table dimension are also indicated in the legend. The total number of $p_{0.05}$ observations in each figure is 14,154,495 .Footnote⁶ Each of the $p_{0.05}$ observations is based on the p-values from 100,000 randomly sampled tables. Therefore, the results in these figures are based on 1,415,449,500,000 randomly sampled tables. The horizontal dotted lines indicate $1 / 2$ , 1, and 2 times the ideal 5th percentile p-value value (0.05) for reference purposes. Sub also includes a reference curve for $0.05 + 1 / | N_{r c} |$ , which represents a theoretical maximum deviation from the ideal p-value assuming that the $| N_{r c} |$ possible unique p-values are equally spaced.

Figure 2. Fifth percentile Fisher-Freeman-Halton p-values versus $| S_{r c} |$ , $| N_{r c} |$ , and $min (p_{F i s h e r})$

This figure shows that all of the 5th percentile p-values are greater than or equal to 0.05. This indicates that the Fisher-Freeman-Halton exact test tends to be conservative, a result which has been observed by others (e.g. [Agresti, Citation2002, pp. 93–94]).

The vertical dotted lines indicate the index values for all $2 \times 3$ tables with row margins = [39,11] and column margins = [35,9,6] (e.g. ). These subfigures indicate that $p_{0.05}$ tends to be between 0.05 and 0.10 for these index values, which is desirable from a test accuracy viewpoint.Footnote⁷

also shows estimated curves for the candidate $U B$ , median, and $L B$ models based on EquationEquation (2)(2) $Q = π (g (p_{i d e a l}) + s exp (λ_{0} + \sum_{i} λ_{i} log (x_{i}))) + \isin$ (2) . The equations for the curves are also indicated in the legend. The models were fit to 13,696,372 observed $p_{0.05}$ values for $R \times C$ tables with $min (p_{F i s h e r}) \leq 0.05$ . The other 458,123 observations with $min (p_{F i s h e r}) > 0.05$ were excluded because they were outside the 0.05 p-value domain-of-validity for the Fisher-Freeman-Halton exact test. The excluded observations included 475 $p_{0.05}$ values for $R \times C$ tables with $| N_{r c} | = 1$ .

These subfigures indicate the observed $p_{0.05}$ values tend to converge towards the ideal 5th percentile p-value as $| S_{r c} |$ or $| N_{r c} |$ increases. This result was anticipated based on the example results in Sub , which indicated that the deviation in the observed p-values from the ideal p-values tends to be inversely proportional to the number of possible unique test p-values. The estimated $U B$ curve in Sub tends to follow the theoretical $0.05 + 1 / | N_{r c} |$ curve, allowing for differences due to the assumed logistic form of the $U B$ model and variation in unique probability spacing. The range of data and the estimated $L B$ curve indicates that the lower bounds for $p_{0.05}$ are approximately equal to the ideal 5th percentile p-value, which is expected for the exact tests. This result was also anticipated based on Sub , which showed that the observed p-values were greater than or equal to the ideal p-values.

Sub is similar to Sub with the following exceptions. The candidate index is $min (p_{F i s h e r})$ . The horizontal axis scale has been transformed to correspond to the vertical axis because they are both probabilities. Sub shows a large number of observations with $min (p_{F i s h e r}) > 0.05$ , and therefore outside the test 0.05 p-value domain-of-validity. The observations outside the 0.05 p-value domain-of-validity were not used to fit the candidate envelope models.

Results for the candidate single index models are summarized in Table C.1 in Appendix C. The results are rank ordered from the smallest (most desirable) combined $\overset{ˉ}{f}$ value to the largest. These results indicate that the single index model based on $| N_{r c} |$ has the best fit, with ${\hat{λ}}_{1} =$ −0.799 $\pm$ 0.357 for the $U B$ curve.

3.1.2. Envelope model search results

Envelope models based on EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) with one or two indices were also investigated. The candidate $q_{1}$ for the candidate $| S_{r c} |$ and $| N_{r c} |$ indices included [−0.75] and [−0.8] based on the estimated ${\hat{λ}}_{1}$ results in Table C.1, plus [−0.5], [−1], [−1,-2], and [−1,-2,-3]. It was assumed that $q_{2} = [1]$ for the candidate $min (p_{F i s h e r})$ index because the model envelope was assumed to converge to the ideal value if $min (p_{F i s h e r}) \to 0$ . The candidate model indices and the resulting $\overset{ˉ}{f}$ figure-of-merit values are summarized in Table C.2. The tabulated results are rank ordered from the smallest (most desirable) combined $\overset{ˉ}{f}$ value to the largest.

These results, like Table C.1, indicate that envelope models based on $| N_{r c} |$ tend to fit the range of observed 5th percentile Fisher-Freeman-Halton p-values better than corresponding models based on $| S_{r c} |$ . This was expected because the magnitude of the discrete staircase effect indicated in is more closely related to the number of unique nominal probabilities than the number of unique tables.

The results in Table C.2 indicate that the candidate envelope model based on $x_{1} = | N_{r c} |$ with $q_{1} = [- 1, - 2, - 3]$ and $min (p_{F i s h e r})$ has the best overall fit to the observed $p_{0.05}$ values with a combined $\overset{ˉ}{f} =$ 0.000712 figure-of-merit. However, only one of the four estimated $U B$ model coefficients was statistically significant at the 0.05 level. Therefore, the model with four coefficients may be overparameterized. We would expect the coefficients for the $L B$ model to be small and statistically significant because the lower bounds for the exact test should be approximately equal to the expected p-value.

The candidate envelope model with the smallest combined $\overset{ˉ}{f}$ and all statistically significant $U B$ model coefficients is based on $x_{1} = | N_{r c} |$ with $q_{1} = [- 0.8]$ and $x_{2} = min (p_{F i s h e r})$ . The estimated coefficients for this model are listed in Table D.2. The combined $\overset{ˉ}{f}$ figure-of-merit for this model is 0.000745, a 5% increase compared to the best overall fit in Table C.2. However, the estimated coefficient for $min (p_{F i s h e r})$ in the $U B$ model is negative. This result indicates that the upper bounds for $p_{0.05}$ tends to increase as $min (p_{F i s h e r}) \to 0$ , which is counterintuitive. These results appear to be associated with $R \times C$ tables that have small $min (p_{F i s h e r})$ values and large 5th percentile p-values. This can occur, for a given set of row and column margin totals, if there is a $R \times C$ table with Fisher-Freeman-Halton p-value near the critical p-value and the next larger p-value is much larger than the critical p-value, typically due to a tie or near tie. For example, given the set of $2 \times 3$ tables with $r = [11, 64]$ and $c = [2, 8, 65]$ , $min (p_{F i s h e r}) = 1.3 E - 11$ and the 5th percentile p-value is 0.164648 (see Figure D.1). This is because the p-value for the table with [1,3,7; 1,5,58] cell counts is 0.049238, which is slightly less than 0.05, and the next largest p-values are 0.164648 for tables with [1,2,8; 1,6,57] and [0,3,8; 2,5,57] cell counts.

If the $min (p_{F i s h e r})$ term is dropped, then the envelope model with the next smallest combined $\overset{ˉ}{f}$ is based on $x_{1} = | N_{r c} |$ with $q_{1} = [- 0.8]$ . The estimated coefficients for this model are listed in , and the estimated $U B$ coefficient for the $| N_{r c} |$ term is statistically significant. The combined $\overset{ˉ}{f}$ figure-of-merit for this envelope model is 0.000753, which is a 1% increase compared to the aforementioned model with the $min (p_{F i s h e r})$ term. The $| N_{r c} |$ term for the $L B$ model is small and not statistically significant, as would be expected for an exact test.

Table 5. Fifth percentile Fisher-Freeman-Halton p-value envelope model fit to $| N_{r c} |$

Display Table

According to the estimated model coefficients in the 95% interval for $p_{0.05}$ would exceed 0.10 if $| N_{r c} | < 20.955$ .Footnote⁸ Therefore, a possible p-value accuracy criterion for the Fisher-Freeman-Halton exact test is $| N_{r c} | \geq 20.955$ .

3.2. Fifth percentile Pearson Chi-square test P-values

3.2.1. Single index trends

The results in compare the 5th percentile p-values from the Monte Carlo simulations of the Pearson chi-square test to $| N_{r c} |$ , ${\hat{n}}_{m i n}$ , $n_{. .}$ , and $min (p_{P e a r s o n})$ respectively. The format of this figure is similar to . The results for seven different $R \times C$ table dimensions are depicted by different symbol types as indicated in the legend. The total number of $p_{0.05}$ observations in each figure is 24,578,997 . These observations are based on 2,457,899,700,000 randomly sampled tables.

Figure 3. Fifth percentile Pearson p-values versus $|N_{r c}|$ , ${\hat{n}}_{min}$ , $n . .$ , and $min (P P e a r s o n)$

Figure 3. Fifth percentile Pearson p-values versus Nrc, nˆmin, n.., and minPPearson

The results in Sub indicate that the observed $p_{0.05}$ values tend to converge towards the ideal 5th percentile p-value as $| N_{r c} |$ or ${\hat{n}}_{m i n}$ increases, but not when $n_{. .}$ increases. The converging ${\hat{n}}_{m i n}$ results in Sub are somewhat consistent with traditional criteria that the expected cell counts should be greater than some minimum value (e.g. 5, 10, or 20). The results in Sub also show that small $| N_{r c} |$ values are undesirable, which was also expected based on the results in Sub and similar results in Sub for the Fisher-Freeman-Halton test. The lack of convergence for $n_{. .}$ in Sub may be attributed to small $| N_{r c} |$ or ${\hat{n}}_{m i n}$ values.

The results in Sub show that the Pearson chi-square test p-values are outside the test p-value domain-of-validity if $min (p_{P e a r s o n})$ is greater than the ideal value (e.g. 0.05). Therefore 46,578 observations with $min (p_{P e a r s o n}) > 0.05$ were excluded from the envelope model quantile regression analysis.

The results in Sub also indicate an abrupt change near ${\hat{n}}_{m i n} = 0.05$ . This change was further explored in Figure D.5, which shows that the differences between the 5th percentile Pearson and Fisher-Freeman-Halton p-values tends to peak near ${\hat{n}}_{m i n} = 0.5$ . Therefore, all of the observations with ${\hat{n}}_{m i n} \leq 0.5$ were also excluded from the quantile regression analysis of the Pearson chi-square test p-values because these observations were also expected a priori to be outside the Pearson domain-of-validity and could adversely affect the overall fit to the assumed candidate model form. The estimated models for the Pearson chi-square test p-value envelope are therefore based on a fit to the remaining 10,249,393 observations, as depicted in Figure D.6. The results SubFigure D.6a more clearly shows the convergence to the ideal p-value as $| N_{r c} |$ increases.

and D.6 also show the estimated curves for the candidate $U B$ , median, and $L B$ models based on EquationEquation (2)(2) $Q = π (g (p_{i d e a l}) + s exp (λ_{0} + \sum_{i} λ_{i} log (x_{i}))) + \isin$ (2) . The estimated $λ_{1}$ and $\overset{ˉ}{f}$ figure-of-merit values for these and other candidate single index models are summarized in Table C.3.

3.2.2. Envelope model search results

The convergence of the observed Pearson chi-square test $p_{0.05}$ values towards the ideal 5th percentile p-value as $| N_{r c} |$ or ${\hat{n}}_{m i n}$ increases indicates the need to determine the independent effects of discreteness and approximation on the accuracy of the test. This was accomplished by considering envelope models based on EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) with up to four candidate indices to address different effects. The assumed candidate powers for the candidate indices based on $| S_{r c} |$ or $| N_{r c} |$ , ${\hat{n}}_{m i n}$ , and $n_{. .}$ where $q_{1} = [- 0.5]$ and $q_{2} = q_{3} = [- 1.0]$ . The rationale for these negative powers is that the range of observed $p_{0.05}$ values is assumed to converge to the ideal 5th percentile p-value as these indices increase towards infinity, and the −0.5 and −1.0 values represent the range of ${\hat{λ}}_{1}$ results in Table C.3. The assumed candidate power for the candidate $min (p_{P e a r s o n})$ index was $q_{4} = [1]$ .

The candidate model indices and the resulting $\overset{ˉ}{f}$ figure-of-merit values for the Pearson chi-square test are summarized in Table C.4. The results are rank ordered from the smallest (most desirable) combined $\overset{ˉ}{f}$ value to the largest. These results indicate that the model with $x_{1} = | N_{r c} |$ , $x_{2} = \sum_{m i n}^{ν} \hat{n}$ , $x_{3} = n_{. .} / ν$ , and $x_{4} = min (p_{P e a r s o n})$ had the smallest combined $\overset{ˉ}{f} = 0.000600$ . However, the estimated $L B$ and $U B$ model coefficients for the $n_{. .} / ν$ and $min (p_{P e a r s o n})$ terms were not statistically significant; the term based on $n_{. .}$ was not consistent in the top ranked models; and the $min (p_{P e a r s o n})$ term did not improve the fit very much. This suggests that these three-term and four-term models may be overparameterized, which is undesirable. Dropping the terms based on $n_{. .}$ and $min (p_{P e a r s o n})$ from the model search increases the combined $\overset{ˉ}{f}$ slightly from $0.000600$ to $0.000617$ . The estimated model coefficients for this top ranked one-term or two-term model are listed in .

Table 6. Fifth percentile Pearson p-value envelope model fit to $| N_{r c} |$ and $(\sum_{m i n}^{ν} \hat{n})$

Display Table

Note that the estimated $L B$ model coefficient for the $\sum_{m i n}^{ν} \hat{n}$ term in has a small positive value that is not statistically significant. A positive value indicates that the lower bounds tends to increase for small $\sum_{m i n}^{ν} \hat{n}$ values, which is counter intuitive. If a negative sign constraint is imposed on the $L B$ model coefficients then the $\sum_{m i n}^{ν} \hat{n}$ term drops out, the $| N_{r c} |^{- 0.5}$ coefficient for the $L B$ model changes from $- 2.1022 \pm 0.65233$ to $- 1.9942 \pm 0.39061$ , and $\overset{ˉ}{f}$ increases slightly from $0.000494$ to $0.000495$ . This result appears to be consistent with the results in Figure D.6, which indicate that $| N_{r c} |$ has a stronger effect on the convergence then ${\hat{n}}_{m i n}$ for ${\hat{n}}_{m i n} > 0.5$ . Dropping the $\sum_{m i n}^{ν} \hat{n}$ term from the $U B$ model is not recommended because $\overset{ˉ}{f}$ increases substantially from 0.000741 to 0.001110 as indicated in Table C.4.

Scatter plots of the 5th percentile Pearson p-values versus the $U B$ and $L B$ models based on $| N_{r c} |$ and $\sum_{m i n}^{ν} \hat{n}$ are depicted in . The diagonal dotted lines indicate the p-values corresponding to the $U B$ and $L B$ models. Therefore, 2.5% of the observed $p_{0.05}$ values in Sub should be above the diagonal line, and 2.5% of the $p_{0.05}$ values in Sub should be below the diagonal line. The horizontal and vertical dotted lines indicate $1 / 2$ , 1, and 2 times the ideal 5th percentile p-value for reference purposes.

Figure 4. Fifth percentile Pearson p-values versus functions of $|N_{r c}|$ and $Σ_{min}^{v} \hat{n}$

Figure 4. Fifth percentile Pearson p-values versus functions of Nrc and Σminvnˆ

depicts contours of the estimated lower and upper bound models for the 5th percentile Pearson p-value versus $| N_{r c} |$ and $\sum_{m i n}^{ν} \hat{n}$ . The lower bounds is depicted in Sub . The small $♢$ symbols are a locus of points that satisfy $L B = 0.025$ and the adjacent shaded region depicted by the $\circ$ symbols indicates where $L B < 0.025$ . The asymptotic values for the $L B = 0.025$ contour are $| N_{r c} | \to 8.545$ when $\sum_{m i n}^{ν} \hat{n} \to \infty$ , and $\sum_{m i n}^{ν} \hat{n} \to$ undefined when $| N_{r c} | \to \infty$ because the model coefficient is non-negative. If the $\sum_{m i n}^{ν} \hat{n}$ term is removed from this $L B$ model then the $L B = 0.025$ contour is simply $| N_{r c} | =$ 7.690 . The $\nabla$ and $Δ$ symbols indicate the relative distributions of observed $p_{0.05}$ values $\geq 0.025$ and $< 0.025$ . The upper bounds is depicted in Sub in a similar manner. The small $♢$ symbols are a locus of points that satisfy $U B = 0.10$ and the shaded region indicates where $U B > 0.10$ . The asymptotic values for the $U B = 0.10$ contour are $| N_{r c} | \to 4.620$ and $\sum_{m i n}^{ν} \hat{n} \to 1.504$ when the other index goes to infinity. These results indicate that $p_{0.05}$ tends to be between 0.025 and 0.10, which is desirable, if the $| N_{r c} |$ and $\sum_{m i n}^{ν} \hat{n}$ coordinate is in the unshaded region of both subfigures. The $\sum_{m i n}^{ν} \hat{n} > 1.504$ criterion when $| N_{r c} | \to \infty$ is consistent with Cochran’s statement that “there is little disturbance to the 5% level when a single expectation is as low as $1 / 2$ ” [Cochran, Citation1952, p. 329].

Figure 5. Fifth percentile Pearson p-value envelope model contours

The $◯$ symbol in indicates the coordinates for $2 \times 3$ tables with $r = [39, 11]$ and $c = [35, 9, 6]$ (e.g. ). This symbol is located in the unshaded region of both subfigures. This indicates the 5th percentile p-values for randomly sampled tables with these margin totals tend to be between 0.025 and 0.10, which is desirable from a test accuracy viewpoint.

The same $L B$ and $U B$ models can also be applied to more stringent limits such as 0.04 to 0.06. The value for the $L B = 0.04$ contour would be $| N_{r c} | =$ 72.9 if the $\sum_{m i n}^{ν} \hat{n}$ term is dropped. The asymptotic values for the $U B = 0.06$ contour would be $| N_{r c} | \to 69.3$ and $\sum_{m i n}^{ν} \hat{n} \to 5.83$ when the other index goes to infinity. This more stringent $U B$ criteria is within the range of the ${\hat{n}}_{m i n} \geq$ 5 or 10 criterion recommended by others.

3.3. Fifth percentile Goodman-Kruskal gamma test P-values

3.3.1. Single Index trends

The results in compare the 5th percentile p-values from the Monte Carlo simulations of the Goodman–Kruskal Gamma test based on $A S E_{0}$ to $| O_{r c} |$ , ${\hat{n}}_{m i n}$ , $n_{. .}$ , and $min (p_{G K_{0}})$ respectively. The format of this figure is also similar to . The total number of $p_{0.05}$ observations was 24,578,997.

Figure 6. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versus $|O_{r c}|$ , ${\hat{n}}_{min}$ , $n . .$ , $min (p_{G K 0})$

Figure 6. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versusOrc, nˆmin,n.., minpGK0

Like the results for the Pearson chi-square test, the results in Sub for the Goodman-Kruskal gamma test indicate that the observed $p_{0.05}$ values tend to converge towards the ideal 5th percentile p-value as ${\hat{n}}_{m i n}$ increases, but not when $n_{. .}$ increases. The results in Sub indicate that the observed $p_{0.05}$ values appear to converge to values less than the ideal 5th percentile p-value as $| O_{r c} |$ increases, indicating possible liberal bias.

The results in Sub show that the Goodman-Kruskal gamma test p-values are outside the test p-value domain-of-validity if $min (p_{G K_{0}})$ is greater than the ideal value (e.g. 0.05). Therefore 152,072 observations with $min (p_{G K_{0}}) > 0.05$ were excluded from the envelope model quantile regression analysis.

In addition, it was observed in Figure D.4 that the differences between the 5th percentile p-values from two different versions of the Goodman-Kruskal gamma test are the largest when ${\hat{n}}_{m i n}$ is about 0.5. This result is similar to, but more pronounced than, the differences between the Fisher-Freeman-Halton and Pearson $p_{0.05}$ values observed in Figure D.5. Therefore, all of the observations with ${\hat{n}}_{m i n} \leq 0.5$ were excluded from the quantile regression analysis of the Goodman-Kruskal gamma test p-values because they could adversely affect the overall fit to the assumed candidate model form. The estimated models for the Goodman-Kruskal gamma test p-value envelope are therefore based on a fit to the remaining 10,143,899 observations with ${\hat{n}}_{m i n} > 0.5$ , as depicted in Figure D.7. The results in SubFigure D.7a also more clearly shows convergence, but not to the ideal p-value, as $| O_{r c} |$ increases.

3.3.2. Envelope model search results

Like the Pearson chi-square test, the convergence of the observed Goodman-Kruskal gamma test $p_{0.05}$ values as $| O_{r c} |$ or ${\hat{n}}_{m i n}$ increases also indicates the need to elucidate the independent effects of discreteness and approximation on the accuracy of the test. This was also accomplished by considering envelope models based on EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) with up to four candidate indices. The assumed candidate $q_{i}$ were the same as for the Pearson chi-square test envelope models, based on similar rationale.

The candidate model indices and the resulting $\overset{ˉ}{f}$ figure-of-merit values for the Goodman-Kruskal gamma test are summarized in Table C.6. The results in this table are also rank ordered from the smallest (most desirable) combined $\overset{ˉ}{f}$ value to the largest. These results indicate that of the candidate model with $x_{1} = | O_{r c} |$ , $x_{2} = ν \sum_{m i n}^{ν} \hat{n}$ , $x_{3} = n_{. .} / R C$ , and $x_{4} = min (p_{G K_{0}})$ had the smallest combined $\overset{ˉ}{f} = 0.000537$ . However, the top ranked model with only three indices fits almost as well. This model drops the $min (p_{G K})$ index and the combined $\overset{ˉ}{f}$ value increases slightly from 0.000537 to 0.000547 . The estimated model coefficients for this envelope model are listed in ().Footnote⁹

Table 7. Fifth percentile Goodman-Kruskal ( $A S E_{0}$ ) p-value envelope model fit to $| O_{r c} |$ , $(ν \sum_{m i n}^{ν} \hat{n})$ , and $(n_{. .} / R C)$

Display Table

Scatter plots of the 5th percentile Goodman-Kruskal p-values (based on $A S E_{0}$ ) versus the $U B$ and $L B$ models in are depicted in . The format and interpretation of these plots are similar to

Figure 7. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versus $|O_{r c}|$ , $v Σ_{min}^{v} \hat{n}$ , and $n . . / R C$

Figure 7. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versus Orc, vΣminvnˆ, and n../RC

. Fifth percentile Goodman-Kruskal p-values (based on $A S E_{0}$ ) versus functions of $| O_{r c} |$ , $ν \sum_{m i n}^{ν} \hat{n}$ , and $n_{. .} / R C$ .

Contours of the estimated lower and upper bound models for the 5th percentile Goodman-Kruskal p-value versus $| O_{r c} |$ , $ν \sum_{m i n}^{ν} \hat{n}$ , and $n_{. .} / R C$ are depicted in Figure D.8 in Appendix D. These plots graphically illustrate that $p_{0.05}$ tends to be between 0.025 and 0.10, which is desirable, if the $| O_{r c} |$ , $ν \sum_{m i n}^{ν} \hat{n}$ , and $n_{. .} / R C$ coordinate is in the unshaded region of all of the subfigures. The $◯$ symbols, which represent the indices for the example in , are outside this desired (i.e. unshaded) region. This could explain the anomalous Goodman-Kruskal gamma test result listed in .

4. Limitations

The results herein are based on Monte Carlo analysis of $R \times C$ contingency tables within the simulation domain listed in . It is assumed that the results for tables with $R > C$ are equivalent to results for tables with $R < C$ due to the symmetry of the hypothesis tests. It is also assumed that the $R \times C$ tables are randomly sampled with fixed row and column margin totals. The observed 5th percentile p-value for each row and column margin total combination is based a random sample of 100,000 tables. The p-value accuracy results are limited to tables with row and column margin totals within the p-value domain-of-validity. The results for the Fisher-Freeman-Halton exact test envelope models are also limited to tables with $| N_{r c} | \geq 2$ ; and the Pearson chi-square test and Goodman-Kruskal gamma test envelope models are limited to tables with ${\hat{n}}_{m i n} > 0.5$ . The estimated envelope models are based on the assumption that all of the 5th percentile p-values for each set of fixed row and column margin totals within the simulation domain are equally weighted. The results are based on a critical test p-value of 0.05, but the methodology and results could be extended to other critical p-values.

5. Recommendations

The following criteria to determine whether or not a hypothesis test should be used for a given $R \times C$ contingency table are recommended based on the results in the previous section. These criteria comprise two parts and are based on the table dimensions and margin totals.

First, the critical p-value must be within the p-value domain-of-validity of the hypothesis test. This can be evaluated by calculating the minimum possible hypothesis test p-value using the most extreme table with the same row and column margin totals. If the minimum possible p-value is greater than or equal to the critical p-value then the test is not reliable and should not be used. It is assumed that the maximum possible p-value is always greater than the critical p-value and does not need to be evaluated.

If the p-value domain-of-validity criterion is met, then the p-value accuracy criteria is evaluated. Ideally the $α$ ’th percentile p-value for a hypothesis test would be $α$ /100% if the null hypothesis is true. We propose that the lower and upper bounds for the percentile p-value corresponding to the critical p-value should be within a factor of two of the critical p-value. For example, if the critical p-value is 0.05, then the $L B$ and $U B$ for the 5th percentile p-value should be between 0.025 and 0.10.Footnote¹⁰ If not then the test p-value may be inaccurate (i.e. either too liberal or too conservative) and unreliable. This involves calculating the $L B$ and $U B$ values using EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) . Indices and estimated coefficients for the Fisher-Freeman-Halton exact test for tables up to $3 \times 3$ and $2 \times 4$ are provided in for the 5th percentile p-value. Indices and estimated coefficients for the Pearson chi-square test and the $A S E_{0}$ version of the Goodman-Kruskal gamma test for tables up to $4 \times 4$ and $2 \times 5$ are provided in for the 5th percentile p-value, provided that ${\hat{n}}_{m i n} > 0.5$ .Footnote¹¹ Example envelope model calculations are in Appendix E. R run scripts to evaluate the recommended criteria are also provided in the supplemental material (see Appendix G).

5.1. Fisher-Freeman-Halton exact test accuracy criteria

The p-value accuracy criterion for the Fisher-Freeman-Halton test with a critical p-value of 0.05 is satisfied if $| N_{r c} | \geq 20.955$ . The $| N_{r c} |$ value can be calculated by the algorithm in Appendix B.2.

5.2. Pearson Chi-Square test accuracy criteria

The p-value accuracy criteria for the Pearson chi-square test depend on both $| N_{r c} |$ and $\sum_{m i n}^{ν} \hat{n}$ . This criteria can be graphically evaluated by referring to . If the $| N_{r c} |$ versus $\sum_{m i n}^{ν} \hat{n}$ coordinate is in one or both of shaded regions in then the test p-value may be either too liberal or too conservative. If the coordinate is in the unshaded region of both subfigures then the recommended criteria for the Pearson chi-square test are satisfied.

The p-value accuracy criteria for the Pearson chi-square test can also be numerically evaluated using EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) and the model coefficients in . The $L B$ and $U B$ values should be within a factor of two of the ideal value (0.05). Otherwise the test could be too liberal if $L B < 0.025$ or too conservative if $U B > 0.10$ .

An example calculation for the 5th percentile Pearson chi-square test p-value envelope is in Table E.2. This example indicates that the 5th percentile p-value for randomly sampled $2 \times 3$ tables with $r = [39, 11]$ and $c = [35, 9, 6]$ is between 0.039402 and 0.083430 for 95% of the tables. This range is between 0.025 and 0.10, which is desirable from a test accuracy viewpoint.

5.3. Goodman–Kruskal Gamma test accuracy criteria

The p-value accuracy criteria for the Goodman-Kruskal gamma test depend on $| O_{r c} |$ , $ν \sum_{m i n}^{ν} \hat{n}$ , and $n_{. .} / R C$ . This criteria require the calculation of $L B$ and $U B$ to determine if the range is within a factor of two of the ideal value (e.g. 0.05). The $L B$ and $U B$ values are calculated using EquationEquation (1)(1) $Q = π (g (p_{i d e a l}) + \sum_{i} \sum_{j} β_{i, j} x_{i}^{q_{i, j}}) + \isin$ (1) and the model coefficients in . The $| O_{r c} |$ value can be calculated by the algorithm in Appendix B.3.

An example calculation for the 5th percentile Goodman-Kruskal gamma test p-value envelope is in Table E.3. This example indicates that the 5th percentile p-value for randomly sampled $2 \times 3$ tables with $r = [39, 11]$ and $c = [35, 9, 6]$ is between 0.015660 and 0.057842 for 95% of the tables. This range is not within a factor of two of the ideal value (0.05), which is undesirable from a test accuracy viewpoint.

Therefore, these recommended criteria indicate that the example $2 \times 3$ table listed in is within the domain-of-validity of the Fisher-Freeman-Halton exact test and the Pearson chi-square test, but not within domain-of-validity of the Goodman-Kruskal gamma test. This could explain the apparently inconsistent hypothesis test results listed in .

6. Conclusions

The domain-of-validity and accuracy of the Fisher-Freeman-Halton exact, Pearson chi-square, and Goodman-Kruskal gamma tests for small $R \times C$ contingency tables were investigated. The Fisher-Freeman-Halton exact test and the Pearson chi-square test consider whether or not the frequencies in a contingency table have equal underlying proportions or independence. The Goodman-Kruskal gamma test considers whether or not a measure of association based on the frequencies in the table is statistically significant.

The investigation involved Monte Carlo simulations of randomly sampled contingency tables up to $4 \times 4$ and $2 \times 5$ in which the row and column variables were independent (i.e. equal underlying proportions). One hundred thousand tables were randomly sampled from each set of fixed row and column margin totals, for all margin total sets up to a pre-determined maximum table total (e.g. 200 for $2 \times 2$ tables). P-values for each test were calculated for each sampled table, and the 5th percentile p-values were determined for each set of row and column margin totals.

It was observed that there were some sets of row and column margin totals for which the minimum possible p-value for a given hypothesis test was greater than likely critical p-values (e.g. 0.05). If this condition occurs than the $R \times C$ table is outside the p-value domain-of-validity of the hypothesis test and the test should not be used.

Envelope models were then estimated using quantile regressions to describe the range of the 5th percentile p-values observed in the Monte Carlo simulations. Candidate indices related to the discreteness and approximation accuracy of the p-value distributions, the total table counts, and the minimum possible p-value were considered. The indices can be calculated from $R$ , $C$ , and the row and column margin totals using the equations and algorithms described herein. The 2.5% and 97.5% quantiles were used to describe the lower and upper bounds of the range in the 5th percentile p-values. Data outside the p-value domain-of-validity were not used for these regressions.

The results for the Fisher-Freeman-Halton exact test indicated that this test tends to be conservative (i.e. the 5th percentile p-value tends to be greater than 0.05), and that the p-value accuracy depends on the number of possible $R \times C$ tables with unique nominal probabilities ( $| N_{r c} |$ ). The lower bounds is approximately equal to the ideal 5th percentile p-value provided that $| N_{r c} | > 1$ . The upper bounds is within a factor of two of the ideal 5th percentile p-value if $| N_{r c} | \geq 20.955$ .

The results for the Pearson chi-square test indicated that this test tends to be unbiased (i.e. the median 5th percentile p-value tends to be about 0.05), and that the p-value accuracy depends on $| N_{r c} |$ and the sum of the $ν$ smallest expected cell counts ( $\sum_{m i n}^{ν} \hat{n}$ ), where $ν$ is the number of degrees-of-freedom for the chi-square statistic. Equations and model coefficients are provided in to estimate the range of 5th percentile chi-square test p-values based on $R$ , $C$ , and the row and column margin totals. The values for $| N_{r c} |$ and $\sum_{m i n}^{ν} \hat{n}$ for which the lower and upper bounds for the 5th percentile p-values are within 0.025 and 0.10 are graphically depicted in .

The results for the Goodman-Kruskal gamma test indicated that the p-value accuracy depends on the number of possible $R \times C$ tables with unique ordinal probabilities ( $| O_{r c} |$ ), $ν \sum_{m i n}^{ν} \hat{n}$ , and $n_{. .} / R C$ . Equations and model coefficients are provided in to estimate the range of 5th percentile gamma test p-values based on $R$ , $C$ , and the row and column margin totals.

We propose that the lower and upper bounds for the percentile p-value corresponding to the critical p-value should be within a factor of two of the critical p-value. Therefore, if the critical p-value is 0.05, then the $L B$ and $U B$ for the 5th percentile p-value should be between 0.025 and 0.10. The accuracy of the 5th percentile Fisher-Freeman-Halton p-value is with this range if $| N_{r c} | \geq 20.955$ . Other levels of p-value accuracy can also be specified, such as between 0.04 and 0.06. The accuracy criteria for the Pearson chi-square test depends on $| N_{r c} |$ and $\sum_{m i n}^{ν} \hat{n}$ , and the Goodman-Kruskal gamma test depends on $| O_{r c} |$ , $ν \sum_{m i n}^{ν} \hat{n}$ , and $n_{. .} / R C$ .

An example application of the proposed criteria to a $2 \times 3$ contingency table was also given. These criteria indicated that the example $2 \times 3$ table is within the domain-of-validity of two of the three hypothesis tests investigated.

The criteria presented herein is specific to the 0.05 critical p-value. Criteria specific to other critical p-values (e.g. 0.01) or other hypothesis tests could also be determined using the methods described herein.

Supplemental material

Supplemental Material

Download PDF (19.9 MB)

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary Material

Supplemental data for this article can be accessed here.

Additional information

Funding

This work was supported by American Honda Motor Co., Inc. [NA];

Notes on contributors

R. M. Van Auken

R. M. Van Auken and S. A. Kebschull have been involved in various multi-disciplinary applied research projects primarily for the ground vehicle industry. Research activities involved modeling, simulation, testing, and analysis of human-vehicle-environmental systems; including vehicle dynamics, manual control, biomechanics, accident data analysis, and risk/benefit analysis. They have also been involved in regulation, testing and standards development. Some of these activities involved analysis of contingency tables using hypothesis tests such as the Fisher exact, Pearson chi-square, and Goodman-Kruskal gamma tests. The results of the research reported herein provided useful insight and understanding for suitability and interpretation of these hypothesis test results.

Notes

1. The results in this paper are specific to a critical p-value of 0.05 because this value is widely used. See [Box, Hunter, and Hunter, 1978, p. 109] and (Wasserstein & Lazar, Citation2016). Other critical p-values such as 0.01 have also been used. The methods used herein could be extended to these other critical p-values in the future.

2. Norušis states `[i]f you reject the null hypothesis that two variables are independent, you may want to describe the nature and strength of the relationship between the two variables’ [Norušis, 2008, p. 172]. This suggests that association is only meaningful if the null hypothesis for independence can be rejected.

3. In our notation

k = R C

and

n = n_{. .}

4. See footnote 1.

5. This condition can occur in $2 \times 2$ tables even for large $n_{. .}$ if one of the row margin totals is equal to 1 and the column margin totals are equal. For example, if $r = [1, 199]$ and $c = [100, 100]$ .

6. Many of the 14,154,495 symbols in Figure 2 mask other symbols due to the symbol size and drawing order. The symbols were drawn in random order in order to depict the relative distribution of the table dimensions. The absolute distribution of the observed p-values is not depicted because of masking.

7. The criteria for a desirable range of $p_{0.05}$ values is arbitrary but directly relates to test p-value accuracy. Cochran [Cochran, 1952, pp. 328–329] suggested a range between 0.04 and 0.06 for a critical p-value of 0.05. Cochran acknowledged that these `limits are, of course, arbitrary; some would be content with less conservative limits.’ We are using a range of 0.025 to 0.10, which corresponds to a relative factor of 2, and is less conservative then Cochran’s suggested limits.

8. If we use the estimated alternative model coefficients in Table D.2 then the 95% interval for $p_{0.05}$ the Fisher-Freeman-Halton test would exceed 0.10 if $| N_{r c} | < 21.738$ and $min (p_{F i s h e r}) = 0$ .

9. The envelope model search results in Tables C.5 and C.6 indicate that models based on $| S_{r c} |$ tend to fit the distribution of 5th percentile Goodman-Kruskal gamma test p-values almost as well as $| O_{r c} |$ . The $| S_{r c} |$ index requires less effort to calculate than $| O_{r c} |$ , which is desirable. Estimated coefficients for an alternative Goodman-Kruskal gamma test envelope model based on $| S_{r c} |$ are listed in Table D.3 in Appendix D.5.

10. Cochran [Cochran, 1952, pp.328–329] suggested a range between 0.04 and 0.06 for a critical p-value of 0.05, and a range between 0.007 and 0.015 for a critical p-value of 0.01. Cochran acknowledged that these `limits are, of course, arbitrary; some would be content with less conservative limits.’ Our proposed limits are also arbitrary and less conservative than Cochran’s limits.

11. We can reasonably assume that the Pearson chi-square and Goodman-Kruskal gamma tests are not valid if ${\hat{n}}_{m i n} \leq 0.5$ based on traditional criteria and extrapolation of the envelope models.

References

Agresti, A. (1996). An introduction to categorical data analysis. John Wiley & Sons, Inc.
Google Scholar
Agresti, A. (2002). Categorical data analysis. John Wiley & Sons, Inc. http://mathdept.iut.ac.ir/sites/mathdept.iut.ac.ir/files/AGRESTI.PDF
Google Scholar
Box, G. E., Hunter, W. G., & Hunter, J. S. (1978). Statistics for experimenters, an introduction to design, data analysis, and model building. John Wiley & Sons.
Google Scholar
Brown, M. B., & Benedetti, J. K. (1977). Sampling behavior of tests for correlation in two-way contingency tables. Journal of the American Statistical Association, 72(358), 309–20. DOI: https://doi.org/10.1080/01621459.1977.10480995
Web of Science ®Google Scholar
Cardillo, G. (2007a). Myfisher23: A very compact routine for fisher’s exact test on 2x3 matrix (version 2.0, updated 2018-05-04) [The mathworks, inc.]. https://www.mathworks.com/matlabcentral/fileexchange/15399-myfisher23
Google Scholar
Cardillo, G. (2007b). Myfisher24: A very compact routine for fisher’s exact test on 2x4 matrix (version 2.0, updated 2018-04-06) [The mathworks, inc.]. https://www.mathworks.com/matlabcentral/fileexchange/19842-myfisher24
Google Scholar
Cardillo, G. (2007c). Myfisher33: A very compact routine for fisher’s exact test on 3x3 matrix (version 2.0, updated 2018-04-06) [The mathworks, inc.]. https://www.mathworks.com/matlabcentral/fileexchange/15482-myfisher33
Google Scholar
Cochran, W. G. (1952). The χ2 test of goodness of fit. The Annals of Mathematical Statistics, 23(3), 315–345. https://doi.org/https://doi.org/10.1214/aoms/1177729380
Google Scholar
Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. Biometrics, 10(4), 417–451. https://doi.org/https://doi.org/10.2307/3001616
Web of Science ®Google Scholar
Craddock, J. M., & Flood, C. R. (1970). The distribution of the χ2 statistic in small contingency tables. Journal of the Royal Statistical Society. Series C, Applied Statistics, 19(2), 173–181. https://rss.onlinelibrary.wiley.com/doi/abs/ https://doi.org/10.2307/2346547
Google Scholar
Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton.
Google Scholar
Diaconis, P., & Gangolli, A. (1994). Rectangular arrays with fixed margins (pp. 452). Department of Statistics, Stanford University. https://statistics.stanford.edu/sites/g/files/sbiybj6031/f/EFS%20NSF%20452.pdf
Google Scholar
Ehwerhemuepha, L., Sok, H., & Rakovski, C. (2019). A more powerful unconditional exact test of homogeneity for 2 × c contingency table analysis. Journal of Applied Statistics, 46(14), 2572–2582. https://doi.org/https://doi.org/10.1080/02664763.2019.1601689
Web of Science ®Google Scholar
Ekstrøm CT. (2020). MESS: Miscellaneous esoteric statistical scripts. R package version 0.5.7. https://CRAN.R-project.org/package=MESS
Google Scholar
Electronics Reliability Subcommittee. (1987). Automotive electronics reliability handbook. Society of Automotive Engineers, Inc.
Google Scholar
Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Oliver and Boyd.
Google Scholar
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). John Wiley & Sons.
Google Scholar
Freeman, G., & Halton, J. (1951). Note on an exact treatment of contingency, goodness of fit and other problems of significance. Biometrika, 38(1–2), 141–149. https://doi.org/https://doi.org/10.1093/biomet/38.1-2.141
PubMed Web of Science ®Google Scholar
Gail, M., & Mantel, N. (1977). Counting the number of contingency tables with fixed margins. Journal of the American Statistical Association, 72(360a), 859–862. DOI: https://doi.org/10.1080/01621459.1977.10479971
Google Scholar
Göktaş, A., & İşçi, O. (2011). A comparison of the most commonly used measures of association for doubly ordered square contingency tables via simulation. Metodološki Zvezki, 8(1), 17–37. http://mrvar.fdv.uni-lj.si/pub/mz/mz8.1/goktas.pdf
Google Scholar
Good, I. J., Gover, T. N., and Michell, G. (1970). Exact distributions for X2 and the likelihood ratio statistic for the equiprobable multinomial distribution. Journal of the American Statistical Association, 65(1), 267.
Google Scholar
Goodman, L. A., & Kruskal, W. H. (1979). Measures of association for cross classifications. Springer-Verlag.
Google Scholar
Grinsted, A. (2008). quantreg: Quantile regression with bootstrapping confidence intervals (version 1.3.0.0) [The mathworks, inc.]. https://www.mathworks.com/matlabcentral/fileexchange/32115-quantreg-x-y-tau-order-nboot
Google Scholar
Kahane, C. (2012). Relationships between fatality risk, mass, and footprint in model year 2000–2007 passenger cars and LTVs - final report. National Highway Traffic Safety Administration. DOT HS 811 665 https://www.regulations.gov/document/NHTSA-2010-0152-0040
Google Scholar
Kendal, M. G. (1952). The Advanced Theory of Statistics (5th Ed. Vol. 1). Griffin, London
Google Scholar
Koehler, K. J., & Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. Journal of the American Statistical Association, 75(370), 336–344. https://doi.org/https://doi.org/10.1080/01621459.1980.10477473
Web of Science ®Google Scholar
Kroese, D. P., & Chan, J. C. C. (2014). Statistical modeling and computation. Springer.
Google Scholar
MathWorks. (2016). Statistics and machine learning toolboxTM user’s guide.
Google Scholar
Mehta, C. R., & Patel, N. R. (1986). Algorithm 643: Fexact: A fortran subroutine for fisher’s exact test on unordered r × c contingency tables. ACM Transactions on Mathematical Software (TOMS), 12(2), 154–161. https://doi.org/https://doi.org/10.1145/6497.214326
Web of Science ®Google Scholar
Norušis, M. J. (2008). SPSS 16.0 statistical procedures companion. Prentice Hall, Inc.
Google Scholar
Ott, R. L., Larson, R., Rexroat, C., and Mendenhall, W. (1992). Statistics, a tool for the social sciences. Duxbury Press.
Google Scholar
R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Google Scholar
Rosenthal, I. (1966). Distribution of the sample version of the measure of association, gamma. Journal of the American Statistical Association, 61(314):440–453. https://doi.org/https://doi.org/10.1080/01621459.1966.10480880
Google Scholar
SAS. (2013). Base SAS® 9.4 procedures guide: Statistical procedures, second edition. SAS Institute Inc. http://support.sas.com/documentation/cdl/en/procstat/66703/PDF/default/procstat.pdf
Google Scholar
SPSS. (2009). PASW® statistics 18 command syntax reference.
Google Scholar
SPSS. (2010). IBM SPSS statistics 19 algorithms. Available from: http://www.sussex.ac.uk/its/pdfs/SPSS_Algorithms_19.pdf
Google Scholar
Tate, M. W., & Hyer, L. A. (1973). Inaccuracy of the x2 test of goodness of fit when expected frequencies are small. Journal of the American Statistical Association, 68(344), 836–841. DOI: https://doi.org/10.1080/01621459.1973.10481433
Web of Science ®Google Scholar
Wasserstein, R. L., & Lazar, N. A. (2016). The asa’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 440–453. https://doi.org/https://doi.org/10.1080/00031305.2016.1154108
Web of Science ®Google Scholar

Type I error convergence of three hypothesis tests for small RxC contingency tables

ABSTRACT

Public Interest Statement

1. Introduction

1.1. Background

1.1.1. R ✕ C contingency tables

1.1.2. Problem statement

Table 1. Example 2×3 contingency table

Table 2. Hypothesis test results for Table 1

Table 3. Example Pearson Chi-Square test recommendations summarized by Tate and Hyer (Tate and Hyer, Citation1973, p. 836)

1.2. Paper organization

2. Methods

2.1. Monte Carlo simulations

Table 4. R×C contingency table simulation domain

2.1.1. Hypothesis test P-values

2.2. Candidate accuracy assessment envelope models

2.2.1. Candidate quantile models

2.2.2. Candidate accuracy assessment indices

2.2.2.1. Candidate discreteness indices

2.2.2.2. Candidate approximation indices

2.2.2.3. Candidate sample size indices

2.2.2.4. Candidate P-value range indices

2.2.3. Quantile regression method

3. Results

3.1. Fifth percentile Fisher-Freeman-Halton exact test P-values

3.1.1. Single Index Trends

3.1.2. Envelope model search results

Table 5. Fifth percentile Fisher-Freeman-Halton p-value envelope model fit to |Nrc|

3.2. Fifth percentile Pearson Chi-square test P-values

3.2.1. Single index trends

3.2.2. Envelope model search results

Table 6. Fifth percentile Pearson p-value envelope model fit to |Nrc| and (∑minνnˆ)

3.3. Fifth percentile Goodman-Kruskal gamma test P-values

3.3.1. Single Index trends

3.3.2. Envelope model search results

Table 7. Fifth percentile Goodman-Kruskal (ASE0) p-value envelope model fit to |Orc|, (ν∑minνnˆ), and (n../RC)

4. Limitations

5. Recommendations

5.1. Fisher-Freeman-Halton exact test accuracy criteria

5.2. Pearson Chi-Square test accuracy criteria

5.3. Goodman–Kruskal Gamma test accuracy criteria

6. Conclusions

Supplemental Material

Disclosure statement

Supplementary Material

Additional information

Funding

Notes on contributors

R. M. Van Auken

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Example $2 \times 3$ contingency table

Table 2. Hypothesis test results for

Table 4. $R \times C$ contingency table simulation domain

Table 5. Fifth percentile Fisher-Freeman-Halton p-value envelope model fit to $| N_{r c} |$

Table 6. Fifth percentile Pearson p-value envelope model fit to $| N_{r c} |$ and $(\sum_{m i n}^{ν} \hat{n})$

Table 7. Fifth percentile Goodman-Kruskal ( $A S E_{0}$ ) p-value envelope model fit to $| O_{r c} |$ , $(ν \sum_{m i n}^{ν} \hat{n})$ , and $(n_{. .} / R C)$