1,691
Views
0
CrossRef citations to date
0
Altmetric
STATISTICS

Type I error convergence of three hypothesis tests for small RxC contingency tables

ORCID Icon & | (Reviewing editor:)
Article: 1934959 | Received 10 Dec 2020, Accepted 23 May 2021, Published online: 01 Sep 2021

ABSTRACT

Many statistical software packages provide hypothesis tests for the independence or association between the rows and columns of a R×C contingency table. However, insufficient guidance is available about the accuracy and domains-of-validity of some tests that are based on assumptions or approximations. This paper assesses the accuracy of the p-values for three tests of small R×C tables using Monte Carlo simulations and quantile regression. Results for the Fisher-Freeman-Halton exact tests indicate the p-value accuracy depends on the number of possible unique nominal probabilities. Results for the Pearson chi-square test indicate the p-value accuracy depends on the number of possible unique nominal probabilities and the expected cell counts. Results for the Goodman-Kruskal gamma test indicate the p-value accuracy depends on the number of possible unique ordinal probabilities, the expected cell counts, and the total cell counts. Empirical models for the accuracy of these tests and recommended domain-of-validity criteria are provided.

Public Interest Statement

Many scientific, technical, and socio-economic fields use contingency tables to evaluate the observed relationships between two or more factors. Statistical tests are widely used to determine whether or not the factors are independent (i.e. no relationship), and whether or not any association between the factors is statistically significant. The results of these tests may affect public policy and legal decisions (e.g. energy, environmental, health, and safety). However, these tests may be based on assumptions and approximations that can affect the accuracy of the test results. Existing guidelines for the use of these tests are often imprecise, inconsistent, subjective, or non-existent. This paper attempts to objectively estimate the accuracy of three exemplar tests and proposes test domain-of-validity criteria based on the estimated accuracy of the tests, which can assist decision-making. The methods developed in this paper can also be applied to other statistical tests and critical test threshold levels.

1. Introduction

Many statistical software packages provide hypothesis tests for the independence (e.g. Pearson chi-square) or association (e.g. Goodman-Kruskal gamma) between the rows and columns of a R×C contingency table. These tests are often based on assumptions or approximations that may limit their domain-of-validity or accuracy. However, it is unclear from the existing literature and statistical software packages what these limitations are.

This paper assesses the domains-of-validity and accuracy of the Fisher-Freeman-Halton exact (Fisher, Citation1934; Freeman & Halton, Citation1951; Mehta & Patel, Citation1986), the Pearson chi-square (Fleiss, Citation1981), and the Goodman and Kruskal gamma (Brown & Benedetti, Citation1977; Goodman & Kruskal, Citation1979) test p-values for contingency tables up to 4×4 and 2×5 versus indices based on the table dimensions and the row and column margin totals. These three tests were evaluated because they are well known and implemented in several statistical software packages, and the results of which are commonly reported and interpreted by a wide audience. The Fisher–Freeman–Halton test was also included because it is based on the exact reference distribution instead of an approximation.

1.1. Background

1.1.1. R ✕ C contingency tables

Two-dimensional R×C contingency tables list the frequencies of two discrete data variable value combinations. Given row and column variables with R and C discrete values, let nij represent the number of times the row variable is equal to the i’th discrete value and column variable is equal to the j’th discrete value. The result is a table with R rows and C columns. The i’th row and j’th column margin totals are denoted by ri=Δni. and cj=Δn.j respectively, where a dot in the subscript denotes summation. The overall total is denoted by n... The expected cell count is nˆij=ricj/n...

There are several different tests for the statistical significance of a R×C table. The appropriateness and accuracy of these tests depends on the type of data, sampling method, sample size, and the null hypothesis (H0) to be tested. The null hypothesis for the Fisher-Freeman-Halton exact and Pearson chi-square tests is the row and column variables are independent or have equal underlying probabilities. The null hypothesis for the Goodman and Kruskal’s gamma test is no association between the row and column variables. Many of these tests are supported by statistical software packages, such as SAS (Citation2013), SPSS (Citation2009), R (R Core Team, Citation2017; Ekstrøm, Citation2020), and MATLAB (MathWorks, Citation2016; Cardillo, Citation2007a).

1.1.2. Problem statement

We have observed that the Goodman-Kruskal gamma test results for association for some contingency tables appear to be inconsistent with the Pearson chi-square or exact test results for independence (or equal underlying probabilities). For example, consider the 2×3 contingency table in . The results of three hypothesis tests for this 2×3 table are listed in . The p-values for the Fisher-Freeman-Halton exact and Pearson chi-square tests are much greater than 0.05, but the p-value for the Goodman-Kruskal gamma test is much less than 0.05.Footnote1 Therefore, these results indicate that there is a statistically significant association between the treatment and outcome even though the null hypothesis for independence or equal underlying probabilities cannot be rejected.Footnote2

Table 1. Example 2×3 contingency table

Table 2. Hypothesis test results for

Existing test guidelines were not sufficient in understanding these apparent inconsistencies (e.g. ). Some recommended criteria for the Pearson chi-square test are that all nˆij should be greater than 5, 10, or even 20. Cochran [Cochran, Citation1952, p. 328], however, commented that the “numbers 10 and 5 appear to have been arbitrarily chosen”. The SPSS CROSSTAB Pearson chi-square test results for the Example in included a footnote “3 cells (50.0%) have expected count less than 5. The minimum expected count is 1.32”. The R chisq.text function presents a warning message for the same table that the “Chi-squared approximation may be incorrect”. However, Cochran indicates in [Cochran, Citation1954, p. 418] that there are a number of special cases where the χ2 approximation may be adequate if some nˆij<5. There is a large body of work that discusses guidelines for using the Pearson chi-square test. Tate and Hyer cited a wide range of opinions in [Tate and Hyer, Citation1973, p. 836] regarding these guidelines, which are summarized . Koehler and Larntz observed that “for the null hypothesis of symmetry, the chi-squared approximation for the Pearson statistic is quite adequate at the .05 and .01 nominal levels for expected frequencies as low as .25 when k3, n10, n2/k10” [Koehler and Larntz, Citation1980, p. 343].Footnote3

Table 3. Example Pearson Chi-Square test recommendations summarized by Tate and Hyer (Tate and Hyer, Citation1973, p. 836)

The guidelines for the Goodman-Kruskal gamma test are almost non-existent. Ott et al. indicates, “Although no specific requirement for sample size is generally given, we should never use this test when n[..]<10” [Ott et al., Citation1992, p. 399]. The accuracy of some of these tests for special cases were previously investigated using Monte Carlo methods in (Craddock & Flood, Citation1970; Göktaş & İşçi, Citation2011), however clear recommendations regarding the domain-of-validity of the tests were not given.

Therefore, the objective of this study was to assess the domain-of-validity and accuracy of the Fisher–Freeman–Halton, Pearson chi-square, and Goodman-Kruskal gamma tests in order to determine generally applicable guidelines for their use. It was not our objective to evaluate other tests or adjustments, such as the mid p-value (Agresti, Citation2002) or Ehwerhemuepha (Ehwerhemuepha et al., Citation2019), which may be more suitable for a given statistical problem.

1.2. Paper organization

The remainder of this paper is organized into five sections. The methods used to are described in Section 2. The results are described in Section 3. Limitations of this analysis are described in Section 4. Recommendations and conclusions are in Sections 5 and 6.

This paper is also accompanied by seven appendices and other supplementary material that are available online. A detailed description of the Monte Carlo simulation method and example results are in Appendix A. Supporting algorithms are described in Appendix B. Empirical envelope model search results are summarized in Appendix C. Supporting methods and results are in Appendix D. Example recommended accuracy criteria calculations are in Appendix E. MATLAB run scripts and functions to reproduce the results herein are described in Appendix F. R run scripts and functions to reproduce the example domain-of-validity and accuracy criteria results herein are described in Appendix G.

2. Methods

The evaluation of the hypothesis test domain-of-validity and accuracy was accomplished in two main steps. First, the 5th percentile p-values for each of the hypothesis tests were estimated for all combinations of R×C tables within a simulation domain using Monte Carlo simulations (e.g. [Kroese and Chan, Citation2014, Chapter 7]). This percentile p-value corresponds to the intended critical p-value for the test (e.g. 0.05).Footnote4 Envelope models to quantify the range of observed 5th percentile p-values based on the table dimensions and row and column margin totals were then determined.

2.1. Monte Carlo simulations

Monte Carlo simulations were used to estimate the 5th percentile p-values (p0.05) for contingency tables with independent row and column variables and fixed margin totals (i.e. Method II sampling [Fleiss, Citation1981, p.27]), for each of the hypothesis tests described in Subsection 2.1.1. This involved first generating a set of randomly sampled contingency tables with independent row and column variables and fixed ri and cj margin totals. Then, for each hypothesis test: the p-values for each randomly sampled table were calculated; the empirical distribution of the p-values was determined; and the 5th percentile p-value was calculated; as depicted by the examples in . This process is described in more detail in Appendix A. All of the results in this paper were calculated using MATLAB version 9.0 (R2016a) with the Statistics and Machine Learning Toolbox version 10.2 (R2016a) (MathWorks, Citation2016).

Figure 1. Distributions of observed hypothesis test p-values for a random sample of 2×3 tables with independent row and column variables, r=[39,11] and c=[35,9,6]

Figure 1. Distributions of observed hypothesis test p-values for a random sample of 2×3 tables with independent row and column variables, r=[39,11] and c=[35,9,6]

Example cumulative probability distributions of observed p-values from three hypothesis tests of randomly sampled tables are shown in . The vertical and horizontal scales of the P–P plots in this figure have been transformed by the cube root of the numerical values to place more emphasis on the smaller probability values that are of interest for hypothesis testing. The cumulative probabilities of the observed values were estimated by the mean rank [Electronics Reliability Subcommittee, Citation1987, pp.83–84].

Ideally the distribution of calculated p-values for each hypothesis test would be uniformly distributed between 0 and 1 if the null hypothesis is true. Therefore the α’th percentile p-value for each test would ideally be α/100%, and all of the symbols in would ideally be on the diagonal dotted lines. For example, p0.05 would ideally be 0.05. The deviation in the observed percentile p-values from the ideal values, as depicted by the symbols in and the solid horizontal line, is the main subject of this analysis.

There are at least two sources of deviation depicted in . The first is a staircase effect associated with the discrete number of possible p-values for tables with the same row and column margin totals. This staircase effect is most apparent in Sub for the Fisher-Freeman-Halton exact test, where the sizes of the adjacent steps are about the same, but the overall size trend is increasing, and there are no other sources of error such as asymptotic approximations. The sum of the horizontal step lengths is 1. Therefore, the horizontal length of each step, which represents the minimum and maximum deviation in the observed p-values from the ideal values, tends to be inversely proportional to the number of possible unique p-values (e.g. |Nrc|). The steps for the Fisher-Freeman-Halton test are also to the right of the diagonal line, indicating the observed p-values tend to be equal or greater than the ideal p-value, and therefore the test tends to be conservative.

The second source of deviation from the ideal p-value, which is present in Sub , but not Sub , is due to the inaccuracy of the assumed reference distribution or asymptotic approximation. This effect is most apparent in Sub , where the calculated p-values between 0.004 and 0.07 tend to be less than the ideal values, based on the assumption that zi=G/ASEi is normally distributed (see Appendix D.3), and therefore tends to be liberal. Furthermore the extreme tails of the calculated p-value distribution are also truncated in the other direction (i.e. tends to be conservative) because 1G1. This later condition occurred for one of the observed values in Sub where G=1, ASE0=0.2785, and the resulting p-value is 0.000330 (see the first row in Table A1). The example results in Sub suggest that the discrete staircase and asymptotic approximation effects on the Pearson chi-square test p-values have similar magnitudes, and that the overall trend tends to be unbiased.

P-value distributions were calculated for all combinations of the row and column margin totals (i.e. ri and cj) within the simulation domain summarized in . This comprised all R×C tables with n.. between the minimum and maximum values listed in columns 3 and 4, inclusive. The R×C table dimensions comprised 2RC4 plus 2×5 tables. Tables with R>C dimensions were not included because the p-values for the hypothesis tests considered herein are same as the transposed table. The 2×5 tables were included to better elucidate potential independent effects of table dimensions and degrees-of-freedom.

Table 4. R×C contingency table simulation domain

The simulation method and example results are described in more detail in Appendix A.

2.1.1. Hypothesis test P-values

P-values for each of the following hypothesis tests were calculated for each unique table:

Fisher and Fisher-Freeman-Halton exact tests. P-values for 2×2 tables were calculated using the fishertest function in the MATLAB statistics toolbox. P-values for 2×3, 2×4, and 3×3 tables were calculated using Cardillo’s myfisher23, myfisher24, and myfisher33 functions (Cardillo, Citation2007a, Citation2007b and Citation2007c).

Pearson chi-square test. This test assumes that X2=i=1Rj=1Cnijnˆij2/nˆij has a χ2 distribution with ν=(R1)(C1) degrees-of-freedom if H0 is true. The p-value is the probability of the observed or larger X2 value if H0 is true, which was calculated using the chi2cdf function in the MATLAB statistics toolbox.

This assumption is one of convenience because the “X2 statistic has approximately a chi-square distribution for large samples sizes. It is difficult to specify what ‘large’ means, but [nˆij]5 is sufficient” [Agresti, Citation1996, p. 28]. The accuracy of the p-value resulting from this assumption and approximation is one of the subjects of this study.

Goodman-Kruskal’s gamma test. There is more than one version of the Goodman-Kruskal gamma test. The original version proposed by Goodman and Kruskal is described in Appendix D.3.2. A modified version based on (Brown & Benedetti, Citation1977), which is implemented in SAS [SAS, Citation2013, p. 160] and SPSS [SPSS, Citation2010, p. 161], is described in Appendix D.3.1. This test assumes that zi=G/ASEi has a normal distribution with zero mean and unit standard deviation if H0, where G and ASEi are calculated according to the equations in Appendix D.3. The subscript i denotes the modifed (0) or original (3) version of the test. The p-value is the probability of the observed or larger |zi| value if H0 is true, which was calculated using the normcdf function in the MATLAB statistics toolbox.

This assumption is also one of convenience because according to (Goodman & Kruskal, Citation1979) (e.g. pages 78 and 93) and (Brown & Benedetti, Citation1977), z0 actually has an approximate, asymptotically unit-normal distribution for large n... The zi value is approximately normally distributed because G is limited to 1G1. The accuracy of the p-value resulting from this assumption and approximation is also one of the subjects of this study.

2.2. Candidate accuracy assessment envelope models

It is assumed herein that the accuracy of the hypothesis test p-values can vary depending on the contingency table dimensions and the row and column margin totals. Candidate envelope models and indices to quantify this relationship are described is this section.

2.2.1. Candidate quantile models

The accuracy of the hypothesis tests is described herein by the 2.5% and 97.5% quantiles of the observed p0.05 values. These quantiles represent the lower bounds (LB) and upper bounds (UB) of a 95% interval enclosed by the envelope model. Note the 0.0% and 100% quantiles were not used because these quantiles are numerically ill-defined and non-unique. For example, one numerical solution for the 100% quantile of observed p0.05 values is simply 1, which is not very descriptive of the upper bounds. It was also generally assumed that the LB and UB will tend to (e.g. asymptotically) converge to the ideal value (pideal) (e.g. 0.05) as the candidate indices tend towards a limiting value. This limiting value can be either infinity or zero, depending on the type of index. Therefore, the following empirical model forms were considered for the LB, UB, and median quantiles:

(1) Q=πgpideal+ijβi,jxiqi,j+\isin(1)
(2) Q=πgpideal+sexpλ0+iλilogxi+\isin(2)

where π(a)=Δexpa/1+expa, g(b)=Δlnb/1b, and xi are one or more candidate indices. The qi,j in EquationEquation (1) are assumed polynomial powers for xi. The βi,j and are unknown coefficients to be estimated by quantile regression (e.g. (Grinsted, Citation2008)). The s coefficient in EquationEquation (2) is either +1 or −1, and λi and are unknown coefficients to be estimated by quantile regression. The form of these equations guarantee that 0<Q<1 for xi>0 and Qpideal if all xi and qi,j<0. These two models are equivalent if β1,1=exp(λ0), q1,1=λ1, and all other βi,j=0 and λi=0.

The polynomial model form in EquationEquation (1) is well suited for fitting one or more indices simultaneously for assumed qi values. The assumed values for qi included [Citation1], [−0.5], [−1], [−1,-2], and [−1,-2,-3] depending on the test and candidate index. Higher degree polynomials can fit the data better than lower degree polynomials, but can result in overparameterized models, which is undesirable.

The logarithmic model form in EquationEquation (2) is only applicable to a single index but can be used to estimate a scalar value for qi. These models tend to be more parsimonious, which is desirable.

2.2.2. Candidate accuracy assessment indices

Four types of candidate indices based on the table dimensions and margin totals were considered to assess different potential effects on p-value accuracy. The first three types of indices have positive values that tend to increase for tables with larger margin totals or overall counts. The last type of index has a value between 0 and 1.

2.2.2.1. Candidate discreteness indices

The following indices were considered to represent effects of the discrete p-value distribution resulting from the whole number table counts (e.g. as depicted by the staircase shaped curve in Sub ):

Src The number of unique R×C tables with ri and cj marginal totals (|S| in this context denotes the cardinality of set S, and Src is the set of all unique R× C tables as defined in Appendix b.1) The rationale for this candidate index is that the distribution of sampled p-values depends on the maximum possible number of unique sampled p-values, which is less than or equal to |Src|. An algorithm to calculate |Src| given R, C, and the margin totals is in Appendix B.1. Other algorithms are described in (Diaconis & Gangolli, Citation1994; Gail & Mantel, Citation1977).

|Nrc| The number of unique nominal probabilities for R×C tables with ri and cj marginal totals. The rationale for this index is that the distribution of sampled nominal (e.g. the Fisher-Freeman-Halton exact test) p-values depends on |Nrc|. An algorithm to calculate |Nrc| given R, C, and the margin totals is in Appendix B.2. Note that |Nrc||Src|.

|Orc| The number of unique ordinal probabilities for R×C tables with ri and cj marginal totals. The rationale for this index is that the distribution of sampled ordinal (e.g. the Goodman-Kruskal gamma test) p-values depends on |Orc|. An algorithm to calculate |Orc| is in Appendix B.3. Note that |Orc||Src|.

2.2.2.2. Candidate approximation indices

The following indices were also considered for the Pearson chi-square and Goodman and Kruskal gamma tests to represent approximation accuracy and asymptotic standard error effects:

• The minimum expected cell count, which is nˆmin=Δmini=1Rminj=1Cnˆij. The rationale for this candidate index is that Fleiss [Fleiss, Citation1981, p. 25] and others have warned that the Pearson chi-square test may not be accurate if nˆmin is small, (e.g. nˆmin< 5, 10, or 20).

• The sum and the average of the ν smallest expected cell counts, which are minνnˆ=Δk=minνnˆk and minνnˆ/ν, where nˆk is the k’th smallest expected cell count. The rationale for these indices is that they can account for multiple small expected cell counts in higher dimension tables. These indices are equal to nˆmin for 2×2 tables.

• The harmonic mean and the root harmonic mean of the expected cell counts, which are nˆhm=ΔRC/i=1Rj=1Cnˆij1, and nˆrhms=ΔRC/i=1Rj=1Cnˆij2. The rationale for these indices is that they can also account for multiple small expected cell counts.

Scaled and offset versions of the candidate expected cell count indices (xi) were also considered (i.e. Axi+B). The offset B=1 was considered to avoid small numerical values which, when inverted by a negative qi,j power in EquationEquation (1), become very large and can adversely affect the overall model fit. Scaling factors A=ν and A=RC were considered to address potential effects of different contingency table dimensions.

The scaling factor A=1/f(X0.052,ν) was also considered to potentially normalize the effects of different table dimensions on the Pearson chi-square p-values. This scaling factor and values are described in Appendix D.1.

2.2.2.3. Candidate sample size indices

The total table count n.. was also considered for the Pearson chi-square and Goodman and Kruskal gamma test envelope models. The rationale is that n.. appears in the gamma test based on ASE0 but not the original test based on ASE3 (see Equations D.3 and D.8). Therefore, the difference between the two gamma test p-values goes to zero if n..0. The scaled indices νn.., n../ν, RCn.., and n../RC were also considered to address potential effects of different contingency table dimensions. Koehler and Larntz (Koehler & Larntz, Citation1980) also investigated the effects of sample size (n..) and the number of cells (RC) on the accuracy of the Pearson chi-square approximation.

2.2.2.4. Candidate P-value range indices

The minimum and maximum possible p-values for a given test and row and column margin totals may also be related to the domain-of-validity and the accuracy of the hypothesis test p-values. These indices are denoted by minptest and maxptest. They can be calculated by the algorithms in Appendix B to calculate |Nrc| and |Orc|.

In general minptest tends to approach 0 and maxptest tends to approach 1, for tables with large numerical values. There are two special cases for the Fisher-Freeman-Halton exact test:

maxpFisher=1 because the p-value for a given table is defined as the sum of the probabilities for all of the tables with the same ri and cj that are as extreme or more extreme as the given table [Fleiss, Citation1981, p. 25] and [MathWorks, Citation2016, p. 25–1564]. There will always be at least one extreme table for which this sum includes all of the possible tables.

minpFisher=1 if |Nrc|=1 because the sum of the probabilities for the only unique probability is 1.0.Footnote5

This leads to the following proposed p-value domain-of-validity criterion:

If the critical p-value (e.g. 0.05) is less than minptest or greater than maxptest then the row and column margin totals are outside the test p-value domain-of-validity for the critical p-value.

This criterion would be in addition to other possible test accuracy criteria to be identified.

The maxptest indices were not considered for the candidate envelope models.

2.2.3. Quantile regression method

Quantile regressions were accomplished using the quantreg3.m function (see the Supplementary Online Material). This function finds the value for b that minimizes fval, defined as

(3) fval=kNy˜kτ1ify˜k<0τotherwise,(3)

where y˜=FXby, y is a N×1 matrix of N observed dependent variable values (e.g. p0.05), X is a corresponding N×p matrix of p corresponding independent variable values (e.g. xij), b is an estimate of β, and τ is the desired quantile (e.g. 0.025 or 0.975), and F is an output transformation (e.g. F=π+g(pideal)). The quantreg3 function is based on Grinsted’s quantreg.m function (Grinsted, Citation2008) with the following extensions:

(1) optional input to provide an initial guess for the estimated β values (b0), otherwise it is assumed that b0=X1y.

(2) optional input to provide an output transformation function F, otherwise it is assumed that F=.

(3) optional output to return the fval value calculated by the fminsearch.m function.

The average difference between the observed and modeled quantile (fˉ) was then calculated by dividing fval by the number of observations in the regression for candidate model comparison purposes.

The initial guesses for the LB and UB models were

(4) b0=XIp×p1y0p×1,(4)

where yk=gp0.05,kgpideal, Ip×p is an p×p identity matrix, and 0p×1 is a p×1 matrix of zeroes. The p additional rows in EquationEquation (4) helps to prevent large b0 values if X is ill-conditioned. The initial guess for the median model was the average of the estimated β values for the LB and UB models.

In some cases, the quantile regression did not converge within the default number of iterations, possibly due to overparameterization. If this occurred then submodels with one or more dropped terms were evaluated, and the submodel yielding in the minimum fmin value was selected. In this case the estimated β values for the missing terms were set to 0. As a final step, the quantile regression with all of the candidate terms was reattempted using the best estimated submodel as an initial guess. This was accomplished by the quantreg_s3.m function, which calls the quantreg3.m function (which was in turn based on Grinsted’s quantreg.m function (Grinsted, Citation2008)).

The 95% confidence intervals for the estimated model coefficients were estimated using the jackknife method described by Kahane [Kahane, Citation2012, pp. 64–66]. This involved classifying the p0.05 observations into (e.g. 14) groups (k) based on the R×C table dimensions and whether or not n.. was even or odd. The quantile regression model coefficients were then re-estimated by removing the data for one group k at a time (i.e. bk). Pseudo estimates of the model coefficients (bk) were then determined based on bkb. The 95% confidence intervals were then estimated based on variation in bk. The rationale for the jackknife grouping is to address potential table dimension sampling effects, and potential effects of “even-size samples” mentioned in [Goodman and Kruskal, Citation1979, p. 363].

3. Results

Single index trends and envelope model search results for the 5th percentile Fisher-Freeman-Halton exact, Pearson chi-square, and Goodman-Kruskal gamma test p-values are presented in separate subsections.

3.1. Fifth percentile Fisher-Freeman-Halton exact test P-values

3.1.1. Single Index Trends

The results in compare the 5th percentile p-values from the Monte Carlo simulations of the Fisher and Fisher-Freeman-Halton exact tests to |Src|, |Nrc|, and minpFisher respectively. The results for four different R×C table dimensions are depicted by different symbol types as indicated in the legend. The number of observed p0.05 values for each table dimension are also indicated in the legend. The total number of p0.05 observations in each figure is 14,154,495 .Footnote6 Each of the p0.05 observations is based on the p-values from 100,000 randomly sampled tables. Therefore, the results in these figures are based on 1,415,449,500,000 randomly sampled tables. The horizontal dotted lines indicate 1/2, 1, and 2 times the ideal 5th percentile p-value value (0.05) for reference purposes. Sub also includes a reference curve for 0.05+1/|Nrc|, which represents a theoretical maximum deviation from the ideal p-value assuming that the |Nrc| possible unique p-values are equally spaced.

Figure 2. Fifth percentile Fisher-Freeman-Halton p-values versus |Src|, |Nrc|, and minpFisher

Figure 2. Fifth percentile Fisher-Freeman-Halton p-values versus |Src|, |Nrc|, and minpFisher

This figure shows that all of the 5th percentile p-values are greater than or equal to 0.05. This indicates that the Fisher-Freeman-Halton exact test tends to be conservative, a result which has been observed by others (e.g. [Agresti, Citation2002, pp. 93–94]).

The vertical dotted lines indicate the index values for all 2×3 tables with row margins = [39,11] and column margins = [35,9,6] (e.g. ). These subfigures indicate that p0.05 tends to be between 0.05 and 0.10 for these index values, which is desirable from a test accuracy viewpoint.Footnote7

also shows estimated curves for the candidate UB, median, and LB models based on EquationEquation (2). The equations for the curves are also indicated in the legend. The models were fit to 13,696,372 observed p0.05 values for R×C tables with minpFisher0.05. The other 458,123 observations with minpFisher>0.05 were excluded because they were outside the 0.05 p-value domain-of-validity for the Fisher-Freeman-Halton exact test. The excluded observations included 475 p0.05 values for R×C tables with |Nrc|=1.

These subfigures indicate the observed p0.05 values tend to converge towards the ideal 5th percentile p-value as |Src| or |Nrc| increases. This result was anticipated based on the example results in Sub , which indicated that the deviation in the observed p-values from the ideal p-values tends to be inversely proportional to the number of possible unique test p-values. The estimated UB curve in Sub tends to follow the theoretical 0.05+1/|Nrc| curve, allowing for differences due to the assumed logistic form of the UB model and variation in unique probability spacing. The range of data and the estimated LB curve indicates that the lower bounds for p0.05 are approximately equal to the ideal 5th percentile p-value, which is expected for the exact tests. This result was also anticipated based on Sub , which showed that the observed p-values were greater than or equal to the ideal p-values.

Sub is similar to Sub with the following exceptions. The candidate index is minpFisher. The horizontal axis scale has been transformed to correspond to the vertical axis because they are both probabilities. Sub shows a large number of observations with minpFisher>0.05, and therefore outside the test 0.05 p-value domain-of-validity. The observations outside the 0.05 p-value domain-of-validity were not used to fit the candidate envelope models.

Results for the candidate single index models are summarized in Table C.1 in Appendix C. The results are rank ordered from the smallest (most desirable) combined fˉ value to the largest. These results indicate that the single index model based on |Nrc| has the best fit, with λˆ1= −0.799 ± 0.357 for the UB curve.

3.1.2. Envelope model search results

Envelope models based on EquationEquation (1) with one or two indices were also investigated. The candidate q1 for the candidate |Src| and |Nrc| indices included [−0.75] and [−0.8] based on the estimated λˆ1 results in Table C.1, plus [−0.5], [−1], [−1,-2], and [−1,-2,-3]. It was assumed that q2=[1] for the candidate minpFisher index because the model envelope was assumed to converge to the ideal value if minpFisher0. The candidate model indices and the resulting fˉ figure-of-merit values are summarized in Table C.2. The tabulated results are rank ordered from the smallest (most desirable) combined fˉ value to the largest.

These results, like Table C.1, indicate that envelope models based on |Nrc| tend to fit the range of observed 5th percentile Fisher-Freeman-Halton p-values better than corresponding models based on |Src|. This was expected because the magnitude of the discrete staircase effect indicated in is more closely related to the number of unique nominal probabilities than the number of unique tables.

The results in Table C.2 indicate that the candidate envelope model based on x1=|Nrc| with q1=[1,2,3] and minpFisher has the best overall fit to the observed p0.05 values with a combined fˉ= 0.000712 figure-of-merit. However, only one of the four estimated UB model coefficients was statistically significant at the 0.05 level. Therefore, the model with four coefficients may be overparameterized. We would expect the coefficients for the LB model to be small and statistically significant because the lower bounds for the exact test should be approximately equal to the expected p-value.

The candidate envelope model with the smallest combined fˉ and all statistically significant UB model coefficients is based on x1=|Nrc| with q1=[0.8] and x2=minpFisher. The estimated coefficients for this model are listed in Table D.2. The combined fˉ figure-of-merit for this model is 0.000745, a 5% increase compared to the best overall fit in Table C.2. However, the estimated coefficient for minpFisher in the UB model is negative. This result indicates that the upper bounds for p0.05 tends to increase as minpFisher0, which is counterintuitive. These results appear to be associated with R×C tables that have small minpFisher values and large 5th percentile p-values. This can occur, for a given set of row and column margin totals, if there is a R×C table with Fisher-Freeman-Halton p-value near the critical p-value and the next larger p-value is much larger than the critical p-value, typically due to a tie or near tie. For example, given the set of 2×3 tables with r=[11,64] and c=[2,8,65], minpFisher=1.3E11 and the 5th percentile p-value is 0.164648 (see Figure D.1). This is because the p-value for the table with [1,3,7; 1,5,58] cell counts is 0.049238, which is slightly less than 0.05, and the next largest p-values are 0.164648 for tables with [1,2,8; 1,6,57] and [0,3,8; 2,5,57] cell counts.

If the minpFisher term is dropped, then the envelope model with the next smallest combined fˉ is based on x1=|Nrc| with q1=[0.8]. The estimated coefficients for this model are listed in , and the estimated UB coefficient for the |Nrc| term is statistically significant. The combined fˉ figure-of-merit for this envelope model is 0.000753, which is a 1% increase compared to the aforementioned model with the minpFisher term. The |Nrc| term for the LB model is small and not statistically significant, as would be expected for an exact test.

Table 5. Fifth percentile Fisher-Freeman-Halton p-value envelope model fit to |Nrc|

According to the estimated model coefficients in the 95% interval for p0.05 would exceed 0.10 if |Nrc|<20.955.Footnote8 Therefore, a possible p-value accuracy criterion for the Fisher-Freeman-Halton exact test is |Nrc|20.955.

3.2. Fifth percentile Pearson Chi-square test P-values

3.2.1. Single index trends

The results in compare the 5th percentile p-values from the Monte Carlo simulations of the Pearson chi-square test to |Nrc|, nˆmin, n.., and minpPearson respectively. The format of this figure is similar to . The results for seven different R×C table dimensions are depicted by different symbol types as indicated in the legend. The total number of p0.05 observations in each figure is 24,578,997 . These observations are based on 2,457,899,700,000 randomly sampled tables.

Figure 3. Fifth percentile Pearson p-values versus Nrc, nˆmin, n.., and minPPearson

Figure 3. Fifth percentile Pearson p-values versus Nrc, nˆmin, n.., and minPPearson

The results in Sub indicate that the observed p0.05 values tend to converge towards the ideal 5th percentile p-value as |Nrc| or nˆmin increases, but not when n.. increases. The converging nˆmin results in Sub are somewhat consistent with traditional criteria that the expected cell counts should be greater than some minimum value (e.g. 5, 10, or 20). The results in Sub also show that small |Nrc| values are undesirable, which was also expected based on the results in Sub and similar results in Sub for the Fisher-Freeman-Halton test. The lack of convergence for n.. in Sub may be attributed to small |Nrc| or nˆmin values.

The results in Sub show that the Pearson chi-square test p-values are outside the test p-value domain-of-validity if minpPearson is greater than the ideal value (e.g. 0.05). Therefore 46,578 observations with minpPearson>0.05 were excluded from the envelope model quantile regression analysis.

The results in Sub also indicate an abrupt change near nˆmin=0.05. This change was further explored in Figure D.5, which shows that the differences between the 5th percentile Pearson and Fisher-Freeman-Halton p-values tends to peak near nˆmin=0.5. Therefore, all of the observations with nˆmin0.5 were also excluded from the quantile regression analysis of the Pearson chi-square test p-values because these observations were also expected a priori to be outside the Pearson domain-of-validity and could adversely affect the overall fit to the assumed candidate model form. The estimated models for the Pearson chi-square test p-value envelope are therefore based on a fit to the remaining 10,249,393 observations, as depicted in Figure D.6. The results SubFigure D.6a more clearly shows the convergence to the ideal p-value as |Nrc| increases.

and D.6 also show the estimated curves for the candidate UB, median, and LB models based on EquationEquation (2). The estimated λ1 and fˉ figure-of-merit values for these and other candidate single index models are summarized in Table C.3.

3.2.2. Envelope model search results

The convergence of the observed Pearson chi-square test p0.05 values towards the ideal 5th percentile p-value as |Nrc| or nˆmin increases indicates the need to determine the independent effects of discreteness and approximation on the accuracy of the test. This was accomplished by considering envelope models based on EquationEquation (1) with up to four candidate indices to address different effects. The assumed candidate powers for the candidate indices based on |Src| or |Nrc|, nˆmin, and n.. where q1=[0.5] and q2=q3=[1.0]. The rationale for these negative powers is that the range of observed p0.05 values is assumed to converge to the ideal 5th percentile p-value as these indices increase towards infinity, and the −0.5 and −1.0 values represent the range of λˆ1 results in Table C.3. The assumed candidate power for the candidate minpPearson index was q4=[1].

The candidate model indices and the resulting fˉ figure-of-merit values for the Pearson chi-square test are summarized in Table C.4. The results are rank ordered from the smallest (most desirable) combined fˉ value to the largest. These results indicate that the model with x1=|Nrc|, x2=minνnˆ, x3=n../ν, and x4=minpPearson had the smallest combined fˉ=0.000600. However, the estimated LB and UB model coefficients for the n../ν and minpPearson terms were not statistically significant; the term based on n.. was not consistent in the top ranked models; and the minpPearson term did not improve the fit very much. This suggests that these three-term and four-term models may be overparameterized, which is undesirable. Dropping the terms based on n.. and minpPearson from the model search increases the combined fˉ slightly from 0.000600 to 0.000617. The estimated model coefficients for this top ranked one-term or two-term model are listed in .

Table 6. Fifth percentile Pearson p-value envelope model fit to |Nrc| and (minνnˆ)

Note that the estimated LB model coefficient for the minνnˆ term in has a small positive value that is not statistically significant. A positive value indicates that the lower bounds tends to increase for small minνnˆ values, which is counter intuitive. If a negative sign constraint is imposed on the LB model coefficients then the minνnˆ term drops out, the |Nrc|0.5 coefficient for the LB model changes from 2.1022±0.65233 to 1.9942±0.39061, and fˉ increases slightly from 0.000494 to 0.000495. This result appears to be consistent with the results in Figure D.6, which indicate that |Nrc| has a stronger effect on the convergence then nˆmin for nˆmin>0.5. Dropping the minνnˆ term from the UB model is not recommended because fˉ increases substantially from 0.000741 to 0.001110 as indicated in Table C.4.

Scatter plots of the 5th percentile Pearson p-values versus the UB and LB models based on |Nrc| and minνnˆ are depicted in . The diagonal dotted lines indicate the p-values corresponding to the UB and LB models. Therefore, 2.5% of the observed p0.05 values in Sub should be above the diagonal line, and 2.5% of the p0.05 values in Sub should be below the diagonal line. The horizontal and vertical dotted lines indicate 1/2, 1, and 2 times the ideal 5th percentile p-value for reference purposes.

Figure 4. Fifth percentile Pearson p-values versus functions of Nrc and Σminvnˆ

Figure 4. Fifth percentile Pearson p-values versus functions of Nrc and Σminvnˆ

depicts contours of the estimated lower and upper bound models for the 5th percentile Pearson p-value versus |Nrc| and minνnˆ. The lower bounds is depicted in Sub . The small symbols are a locus of points that satisfy LB=0.025 and the adjacent shaded region depicted by the symbols indicates where LB<0.025. The asymptotic values for the LB=0.025 contour are |Nrc|8.545 when minνnˆ, and minνnˆ undefined when |Nrc| because the model coefficient is non-negative. If the minνnˆ term is removed from this LB model then the LB=0.025 contour is simply |Nrc|= 7.690 . The and Δ symbols indicate the relative distributions of observed p0.05 values 0.025 and <0.025. The upper bounds is depicted in Sub in a similar manner. The small symbols are a locus of points that satisfy UB=0.10 and the shaded region indicates where UB>0.10. The asymptotic values for the UB=0.10 contour are |Nrc|4.620 and minνnˆ1.504 when the other index goes to infinity. These results indicate that p0.05 tends to be between 0.025 and 0.10, which is desirable, if the |Nrc| and minνnˆ coordinate is in the unshaded region of both subfigures. The minνnˆ>1.504 criterion when |Nrc| is consistent with Cochran’s statement that “there is little disturbance to the 5% level when a single expectation is as low as 1/2” [Cochran, Citation1952, p. 329].

Figure 5. Fifth percentile Pearson p-value envelope model contours

Figure 5. Fifth percentile Pearson p-value envelope model contours

The symbol in indicates the coordinates for 2×3 tables with r=[39,11] and c=[35,9,6] (e.g. ). This symbol is located in the unshaded region of both subfigures. This indicates the 5th percentile p-values for randomly sampled tables with these margin totals tend to be between 0.025 and 0.10, which is desirable from a test accuracy viewpoint.

The same LB and UB models can also be applied to more stringent limits such as 0.04 to 0.06. The value for the LB=0.04 contour would be |Nrc|= 72.9 if the minνnˆ term is dropped. The asymptotic values for the UB=0.06 contour would be |Nrc|69.3 and minνnˆ5.83 when the other index goes to infinity. This more stringent UB criteria is within the range of the nˆmin 5 or 10 criterion recommended by others.

3.3. Fifth percentile Goodman-Kruskal gamma test P-values

3.3.1. Single Index trends

The results in compare the 5th percentile p-values from the Monte Carlo simulations of the Goodman–Kruskal Gamma test based on ASE0 to |Orc|, nˆmin, n.., and minpGK0 respectively. The format of this figure is also similar to . The total number of p0.05 observations was 24,578,997.

Figure 6. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versusOrc, nˆmin,n.., minpGK0

Figure 6. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versusOrc, nˆmin,n.., minpGK0

Like the results for the Pearson chi-square test, the results in Sub for the Goodman-Kruskal gamma test indicate that the observed p0.05 values tend to converge towards the ideal 5th percentile p-value as nˆmin increases, but not when n.. increases. The results in Sub indicate that the observed p0.05 values appear to converge to values less than the ideal 5th percentile p-value as |Orc| increases, indicating possible liberal bias.

The results in Sub show that the Goodman-Kruskal gamma test p-values are outside the test p-value domain-of-validity if minpGK0 is greater than the ideal value (e.g. 0.05). Therefore 152,072 observations with minpGK0>0.05 were excluded from the envelope model quantile regression analysis.

In addition, it was observed in Figure D.4 that the differences between the 5th percentile p-values from two different versions of the Goodman-Kruskal gamma test are the largest when nˆmin is about 0.5. This result is similar to, but more pronounced than, the differences between the Fisher-Freeman-Halton and Pearson p0.05 values observed in Figure D.5. Therefore, all of the observations with nˆmin0.5 were excluded from the quantile regression analysis of the Goodman-Kruskal gamma test p-values because they could adversely affect the overall fit to the assumed candidate model form. The estimated models for the Goodman-Kruskal gamma test p-value envelope are therefore based on a fit to the remaining 10,143,899 observations with nˆmin>0.5, as depicted in Figure D.7. The results in SubFigure D.7a also more clearly shows convergence, but not to the ideal p-value, as |Orc| increases.

3.3.2. Envelope model search results

Like the Pearson chi-square test, the convergence of the observed Goodman-Kruskal gamma test p0.05 values as |Orc| or nˆmin increases also indicates the need to elucidate the independent effects of discreteness and approximation on the accuracy of the test. This was also accomplished by considering envelope models based on EquationEquation (1) with up to four candidate indices. The assumed candidate qi were the same as for the Pearson chi-square test envelope models, based on similar rationale.

The candidate model indices and the resulting fˉ figure-of-merit values for the Goodman-Kruskal gamma test are summarized in Table C.6. The results in this table are also rank ordered from the smallest (most desirable) combined fˉ value to the largest. These results indicate that of the candidate model with x1=|Orc|, x2=νminνnˆ, x3=n../RC, and x4=minpGK0 had the smallest combined fˉ=0.000537. However, the top ranked model with only three indices fits almost as well. This model drops the minpGK index and the combined fˉ value increases slightly from 0.000537 to 0.000547 . The estimated model coefficients for this envelope model are listed in ().Footnote9

Table 7. Fifth percentile Goodman-Kruskal (ASE0) p-value envelope model fit to |Orc|, (νminνnˆ), and (n../RC)

Scatter plots of the 5th percentile Goodman-Kruskal p-values (based on ASE0) versus the UB and LB models in are depicted in . The format and interpretation of these plots are similar to

Figure 7. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versus Orc, vΣminvnˆ, and n../RC

Figure 7. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versus Orc, vΣminvnˆ, and n../RC

. Fifth percentile Goodman-Kruskal p-values (based on ASE0) versus functions of |Orc|, νminνnˆ, and n../RC.

Contours of the estimated lower and upper bound models for the 5th percentile Goodman-Kruskal p-value versus |Orc|, νminνnˆ, and n../RC are depicted in Figure D.8 in Appendix D. These plots graphically illustrate that p0.05 tends to be between 0.025 and 0.10, which is desirable, if the |Orc|, νminνnˆ, and n../RC coordinate is in the unshaded region of all of the subfigures. The symbols, which represent the indices for the example in , are outside this desired (i.e. unshaded) region. This could explain the anomalous Goodman-Kruskal gamma test result listed in .

4. Limitations

The results herein are based on Monte Carlo analysis of R×C contingency tables within the simulation domain listed in . It is assumed that the results for tables with R>C are equivalent to results for tables with R<C due to the symmetry of the hypothesis tests. It is also assumed that the R×C tables are randomly sampled with fixed row and column margin totals. The observed 5th percentile p-value for each row and column margin total combination is based a random sample of 100,000 tables. The p-value accuracy results are limited to tables with row and column margin totals within the p-value domain-of-validity. The results for the Fisher-Freeman-Halton exact test envelope models are also limited to tables with |Nrc|2; and the Pearson chi-square test and Goodman-Kruskal gamma test envelope models are limited to tables with nˆmin>0.5. The estimated envelope models are based on the assumption that all of the 5th percentile p-values for each set of fixed row and column margin totals within the simulation domain are equally weighted. The results are based on a critical test p-value of 0.05, but the methodology and results could be extended to other critical p-values.

5. Recommendations

The following criteria to determine whether or not a hypothesis test should be used for a given R×C contingency table are recommended based on the results in the previous section. These criteria comprise two parts and are based on the table dimensions and margin totals.

First, the critical p-value must be within the p-value domain-of-validity of the hypothesis test. This can be evaluated by calculating the minimum possible hypothesis test p-value using the most extreme table with the same row and column margin totals. If the minimum possible p-value is greater than or equal to the critical p-value then the test is not reliable and should not be used. It is assumed that the maximum possible p-value is always greater than the critical p-value and does not need to be evaluated.

If the p-value domain-of-validity criterion is met, then the p-value accuracy criteria is evaluated. Ideally the α’th percentile p-value for a hypothesis test would be α/100% if the null hypothesis is true. We propose that the lower and upper bounds for the percentile p-value corresponding to the critical p-value should be within a factor of two of the critical p-value. For example, if the critical p-value is 0.05, then the LB and UB for the 5th percentile p-value should be between 0.025 and 0.10.Footnote10 If not then the test p-value may be inaccurate (i.e. either too liberal or too conservative) and unreliable. This involves calculating the LB and UB values using EquationEquation (1). Indices and estimated coefficients for the Fisher-Freeman-Halton exact test for tables up to 3×3 and 2×4 are provided in for the 5th percentile p-value. Indices and estimated coefficients for the Pearson chi-square test and the ASE0 version of the Goodman-Kruskal gamma test for tables up to 4×4 and 2×5 are provided in for the 5th percentile p-value, provided that nˆmin>0.5.Footnote11 Example envelope model calculations are in Appendix E. R run scripts to evaluate the recommended criteria are also provided in the supplemental material (see Appendix G).

5.1. Fisher-Freeman-Halton exact test accuracy criteria

The p-value accuracy criterion for the Fisher-Freeman-Halton test with a critical p-value of 0.05 is satisfied if |Nrc|20.955. The |Nrc| value can be calculated by the algorithm in Appendix B.2.

5.2. Pearson Chi-Square test accuracy criteria

The p-value accuracy criteria for the Pearson chi-square test depend on both |Nrc| and minνnˆ. This criteria can be graphically evaluated by referring to . If the |Nrc| versus minνnˆ coordinate is in one or both of shaded regions in then the test p-value may be either too liberal or too conservative. If the coordinate is in the unshaded region of both subfigures then the recommended criteria for the Pearson chi-square test are satisfied.

The p-value accuracy criteria for the Pearson chi-square test can also be numerically evaluated using EquationEquation (1) and the model coefficients in . The LB and UB values should be within a factor of two of the ideal value (0.05). Otherwise the test could be too liberal if LB<0.025 or too conservative if UB>0.10.

An example calculation for the 5th percentile Pearson chi-square test p-value envelope is in Table E.2. This example indicates that the 5th percentile p-value for randomly sampled 2×3 tables with r=[39,11] and c=[35,9,6] is between 0.039402 and 0.083430 for 95% of the tables. This range is between 0.025 and 0.10, which is desirable from a test accuracy viewpoint.

5.3. Goodman–Kruskal Gamma test accuracy criteria

The p-value accuracy criteria for the Goodman-Kruskal gamma test depend on |Orc|, νminνnˆ, and n../RC. This criteria require the calculation of LB and UB to determine if the range is within a factor of two of the ideal value (e.g. 0.05). The LB and UB values are calculated using EquationEquation (1) and the model coefficients in . The |Orc| value can be calculated by the algorithm in Appendix B.3.

An example calculation for the 5th percentile Goodman-Kruskal gamma test p-value envelope is in Table E.3. This example indicates that the 5th percentile p-value for randomly sampled 2×3 tables with r=[39,11] and c=[35,9,6] is between 0.015660 and 0.057842 for 95% of the tables. This range is not within a factor of two of the ideal value (0.05), which is undesirable from a test accuracy viewpoint.

Therefore, these recommended criteria indicate that the example 2×3 table listed in is within the domain-of-validity of the Fisher-Freeman-Halton exact test and the Pearson chi-square test, but not within domain-of-validity of the Goodman-Kruskal gamma test. This could explain the apparently inconsistent hypothesis test results listed in .

6. Conclusions

The domain-of-validity and accuracy of the Fisher-Freeman-Halton exact, Pearson chi-square, and Goodman-Kruskal gamma tests for small R×C contingency tables were investigated. The Fisher-Freeman-Halton exact test and the Pearson chi-square test consider whether or not the frequencies in a contingency table have equal underlying proportions or independence. The Goodman-Kruskal gamma test considers whether or not a measure of association based on the frequencies in the table is statistically significant.

The investigation involved Monte Carlo simulations of randomly sampled contingency tables up to 4×4 and 2×5 in which the row and column variables were independent (i.e. equal underlying proportions). One hundred thousand tables were randomly sampled from each set of fixed row and column margin totals, for all margin total sets up to a pre-determined maximum table total (e.g. 200 for 2×2 tables). P-values for each test were calculated for each sampled table, and the 5th percentile p-values were determined for each set of row and column margin totals.

It was observed that there were some sets of row and column margin totals for which the minimum possible p-value for a given hypothesis test was greater than likely critical p-values (e.g. 0.05). If this condition occurs than the R×C table is outside the p-value domain-of-validity of the hypothesis test and the test should not be used.

Envelope models were then estimated using quantile regressions to describe the range of the 5th percentile p-values observed in the Monte Carlo simulations. Candidate indices related to the discreteness and approximation accuracy of the p-value distributions, the total table counts, and the minimum possible p-value were considered. The indices can be calculated from R, C, and the row and column margin totals using the equations and algorithms described herein. The 2.5% and 97.5% quantiles were used to describe the lower and upper bounds of the range in the 5th percentile p-values. Data outside the p-value domain-of-validity were not used for these regressions.

The results for the Fisher-Freeman-Halton exact test indicated that this test tends to be conservative (i.e. the 5th percentile p-value tends to be greater than 0.05), and that the p-value accuracy depends on the number of possible R×C tables with unique nominal probabilities (|Nrc|). The lower bounds is approximately equal to the ideal 5th percentile p-value provided that |Nrc|>1. The upper bounds is within a factor of two of the ideal 5th percentile p-value if |Nrc|20.955.

The results for the Pearson chi-square test indicated that this test tends to be unbiased (i.e. the median 5th percentile p-value tends to be about 0.05), and that the p-value accuracy depends on |Nrc| and the sum of the ν smallest expected cell counts (minνnˆ), where ν is the number of degrees-of-freedom for the chi-square statistic. Equations and model coefficients are provided in to estimate the range of 5th percentile chi-square test p-values based on R, C, and the row and column margin totals. The values for |Nrc| and minνnˆ for which the lower and upper bounds for the 5th percentile p-values are within 0.025 and 0.10 are graphically depicted in .

The results for the Goodman-Kruskal gamma test indicated that the p-value accuracy depends on the number of possible R×C tables with unique ordinal probabilities (|Orc|), νminνnˆ, and n../RC. Equations and model coefficients are provided in to estimate the range of 5th percentile gamma test p-values based on R, C, and the row and column margin totals.

We propose that the lower and upper bounds for the percentile p-value corresponding to the critical p-value should be within a factor of two of the critical p-value. Therefore, if the critical p-value is 0.05, then the LB and UB for the 5th percentile p-value should be between 0.025 and 0.10. The accuracy of the 5th percentile Fisher-Freeman-Halton p-value is with this range if |Nrc|20.955. Other levels of p-value accuracy can also be specified, such as between 0.04 and 0.06. The accuracy criteria for the Pearson chi-square test depends on |Nrc| and minνnˆ, and the Goodman-Kruskal gamma test depends on |Orc|, νminνnˆ, and n../RC.

An example application of the proposed criteria to a 2×3 contingency table was also given. These criteria indicated that the example 2×3 table is within the domain-of-validity of two of the three hypothesis tests investigated.

The criteria presented herein is specific to the 0.05 critical p-value. Criteria specific to other critical p-values (e.g. 0.01) or other hypothesis tests could also be determined using the methods described herein.

Supplemental material

Supplemental Material

Download PDF (19.9 MB)

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary Material

Supplemental data for this article can be accessed here.

Additional information

Funding

This work was supported by American Honda Motor Co., Inc. [NA];

Notes on contributors

R. M. Van Auken

R. M. Van Auken and S. A. Kebschull have been involved in various multi-disciplinary applied research projects primarily for the ground vehicle industry. Research activities involved modeling, simulation, testing, and analysis of human-vehicle-environmental systems; including vehicle dynamics, manual control, biomechanics, accident data analysis, and risk/benefit analysis. They have also been involved in regulation, testing and standards development. Some of these activities involved analysis of contingency tables using hypothesis tests such as the Fisher exact, Pearson chi-square, and Goodman-Kruskal gamma tests. The results of the research reported herein provided useful insight and understanding for suitability and interpretation of these hypothesis test results.

Notes

1. The results in this paper are specific to a critical p-value of 0.05 because this value is widely used. See [Box, Hunter, and Hunter, 1978, p. 109] and (Wasserstein & Lazar, Citation2016). Other critical p-values such as 0.01 have also been used. The methods used herein could be extended to these other critical p-values in the future.

2. Norušis states `[i]f you reject the null hypothesis that two variables are independent, you may want to describe the nature and strength of the relationship between the two variables’ [Norušis, 2008, p. 172]. This suggests that association is only meaningful if the null hypothesis for independence can be rejected.

3. In our notation k=RC and n=n...

4. See footnote 1.

5. This condition can occur in 2×2 tables even for large n.. if one of the row margin totals is equal to 1 and the column margin totals are equal. For example, if r=[1,199] and c=[100,100].

6. Many of the 14,154,495 symbols in Figure 2 mask other symbols due to the symbol size and drawing order. The symbols were drawn in random order in order to depict the relative distribution of the table dimensions. The absolute distribution of the observed p-values is not depicted because of masking.

7. The criteria for a desirable range of p0.05 values is arbitrary but directly relates to test p-value accuracy. Cochran [Cochran, 1952, pp. 328–329] suggested a range between 0.04 and 0.06 for a critical p-value of 0.05. Cochran acknowledged that these `limits are, of course, arbitrary; some would be content with less conservative limits.’ We are using a range of 0.025 to 0.10, which corresponds to a relative factor of 2, and is less conservative then Cochran’s suggested limits.

8. If we use the estimated alternative model coefficients in Table D.2 then the 95% interval for p0.05 the Fisher-Freeman-Halton test would exceed 0.10 if |Nrc|<21.738 and minpFisher=0.

9. The envelope model search results in Tables C.5 and C.6 indicate that models based on |Src| tend to fit the distribution of 5th percentile Goodman-Kruskal gamma test p-values almost as well as |Orc|. The |Src| index requires less effort to calculate than |Orc|, which is desirable. Estimated coefficients for an alternative Goodman-Kruskal gamma test envelope model based on |Src| are listed in Table D.3 in Appendix D.5.

10. Cochran [Cochran, 1952, pp.328–329] suggested a range between 0.04 and 0.06 for a critical p-value of 0.05, and a range between 0.007 and 0.015 for a critical p-value of 0.01. Cochran acknowledged that these `limits are, of course, arbitrary; some would be content with less conservative limits.’ Our proposed limits are also arbitrary and less conservative than Cochran’s limits.

11. We can reasonably assume that the Pearson chi-square and Goodman-Kruskal gamma tests are not valid if nˆmin0.5 based on traditional criteria and extrapolation of the envelope models.

References