24
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models

, &
Received 20 Sep 2023, Accepted 05 May 2024, Published online: 15 Jun 2024

Abstract

We present a method to generate contingency tables that follow loglinear models with prescribed marginal probabilities and dependence structures. We make use of (loglinear) Poisson regression, where the dependence structures, described using odds ratios, are implemented using an offset term. Other statistical models related to loglinear models that fall into the scope of this paper, such as the logistic regression model, the latent class model and the extended Rasch are discussed as well. We apply this methodology to carry out simulation studies in the context of population size estimation using dual system and triple system estimators, popular in official statistics. These estimators use contingency tables that summarize the counts of elements enumerated or captured within lists that are linked. The simulation is used to investigate these estimators in the situation that the model assumptions are fulfilled, and the situation that the model assumptions are violated.

1. Introduction

In simulation studies that make use of contingency tables, samples are drawn from a population with properties that are well understood. In this paper a method is proposed for the generation of contingency tables with fixed marginal probabilities and prescribed dependence structures that are defined in terms of odds ratios. An odds ratio describes the strength of the association for a two-way contingency table. For multi-way tables conditional odds ratios play the same role for partial associations between variables [Citation1]. A simple example of such a population would be a two-way contingency table with marginal probabilities of 0.7 and 0.8 and the dependence represented by an odds ratio of 2.

Loglinear models have interaction parameters that can be directly understood in terms of odds ratios. Therefore the proposed method is built around the use of loglinear models. Models with two and three variables are explored, where each variable is dichotomous. However, the methodology presented can generate populations using loglinear models of any size, in terms of both the number of variables and the number of levels of the variables. Also, we can generate populations following logistic regression models having only categorical explanatory variables, and latent variable models that can be formulated as loglinear models, such as the latent class model [Citation2] and the extended Rasch model [Citation3].

There are a few ways to fit loglinear models. It suits our purposes to define loglinear models in terms of Poisson regression models with a log link [Citation4]. Such models can be fitted with, for example, iteratively reweighted least squares (IRLS) [Citation5]. The models we use require the marginal probabilities and odds ratios as input parameters. As a first step, the given marginal probabilities are used to generate the probabilities in a loglinear model for which complete independence holds [Citation6,Citation7, p. 345]. The prespecified interaction structure is imposed by using an offset, which is an additional explanatory variable with a prespecified parameter value that is fixed and not estimated.

When each margin of a contingency table has only two categories, the contingency table is a summary of the observations of correlated binary variates. Previously, Lee [Citation8] proposed a linear programming method for this problem. In comparison to his method, our method is simpler. Also, we present our method in the context of various applications of the loglinear model where conditional odds ratios can also be specified, and of latent variable models. The work of Gange [Citation9] is related to our work in the sense that he uses the iterative proportional fitting algorithm, an algorithm that was very popular in the early ages of loglinear modelling; yet he does not make a link to loglinear models and mentions odds rations only in passing. Choi [Citation10] also presented a method based on loglinear models, where the association is formulated in terms of the Pearson chi-square coefficient, and iterative proportional fitting is used in the generation of probabilities. Thus the way in which the association is specified by Choi is different from our method, that is more closely related to the way in which the loglinear model is defined. Similar to our work, Kateri [Citation11, section 2.2.5, p. 43] described that local odds ratios and row and column marginal distributions uniquely determine the table of joint probabilities, which also holds for any other minimal set of odds ratios. Our work further develops this and demonstrates a simple approach to generating data from this information. Geenens [Citation12] sets out a similar approach into a framework where the dependence is measured using copulas. He explains that the bivariate distributions are uniquely determined by a ‘margin-free’ association parameter (and that multiple different kinds of parameter are possible, but with one-to-one relations between them.

There is also much related but different work on generating correlated binary variates, see, for example, Emrich and Piedmonte [Citation13], Lee [Citation8], Gange [Citation9], Park et al. [Citation14], Lee [Citation15], Lunn and Davies [Citation16], Al Osh and Lee [Citation17], Demirtas [Citation18] and Jiang et al. [Citation19]. There is also some (though rather less) work on generating correlated multinomial or ordinal variables which can be used when the margins have more than two categories, for example Gange [Citation9], Choi [Citation10], Amatya and Demirtas [Citation20] and Touloumis [Citation21]. These approaches mostly focus on generating population probabilities in order to assess the performance of generalized estimating equation (GEE) models. In passing, some of these papers mention that their methodology can also be used with odds ratios instead of correlations, but they do not provide a detailed description of how this works. Geenens [Citation12] extends the connection of copulas and margin-free association measures to general discrete distributions. Combined with marginal information these generate distributions and Geenens demonstrates their existence and uniqueness. Our work is different in that we focus on generating data from loglinear models where the dependence is specified using odds ratios rather than correlations, and describe how this leads to a simple implementation using standard functions for fitting loglinear models.

The methodology we propose is illustrated with an application to population size estimation, as has been used in population censuses. This follows on from the work of Brown et al. [Citation22], Baffour et al. [Citation23] and Gerritse et al. [Citation24] who used a fixed odds ratio to investigate the impacts of dependence on population size estimation. In this application, lists where individuals within the target population are enumerated are used to estimate the population sizes. These lists are subject to incompleteness (undercoverage error), and multiple system estimation has been proposed to adjust for this undercoverage error. Some assumptions about the dependence of the lists must be made (though these may be quite weak where there are many lists), so it is important to test these methods under prespecified conditions. The proposed methodology and its application allow investigations into the impacts of given dependence structures within contingency tables to be carried out, varying the sizes of odds ratios, population sizes and response probabilities, and under different models.

The structure of the paper is as follows. In Section 2 we outline the methodology around log linear models and model fitting. From this we develop the proposed method for including prespecified interaction parameters in Section 2.4. In Section 3 we present an application where the proposed method is used to investigate the properties of population size estimation under dependence. A simulation study is presented for two-way and three-way contingency tables, alongside steps for users to implement the methodology. Section 4 makes an assessment of the advantages of our approach compared with existing methods for correlated binary or categorical data.

2. Methodology

2.1. Loglinear models

2.1.1. Models for two-way tables

The simplest situation that we consider is the loglinear model with two dichotomous variables A and B. Let A be indexed by i = 0, 1 and B by j = 0, 1. We denote the expected count for cell (i,j) of the two-way contingency table of variables A and B by μij. We denote the loglinear parameters by λ. Then the saturated loglinear model for two lists is: (1) logμij=λ+λiA+λjB+λijAB,(1) where λ is the intercept term, λiA and λjB are the respective parameters for variables A and B, and λijAB is the two factor interaction parameter. The parameters are identified by setting the parameter equal to 0 when an index is 0, i.e. λ0A=λ0B=λ00AB=λ01AB=λ10AB=0. Thus the model has four parameters and four observed counts, and hence is called saturated. The loglinear interaction parameter λ11AB is closely related to odds ratio θAB and the expected counts by (2) θAB=exp(λ11AB)=(μ11/μ01)(μ10/μ00)=μ11μ00μ01μ10.(2) When λijAB=0, the odds ratio θAB=1, and the independence model holds.

The proposed methodology can be extended in a straightforward way to variables with more than two levels (compare [Citation11], section 2). Assume a 3×3 table with levels i, j = 1, 2, 3, with level 3 being the reference category. The model is described in Equation (Equation1) with i, j now having different levels than in the 2×2 context. If we identify the 9 interaction parameters by setting these parameters equal to 0 when (i,j=3), then there are four interaction parameters to be estimated, namely λ11AB,λ12AB,λ21AB and λ22AB. Thus there are four odds ratios, θ11AB,θ12AB,θ21AB and θ22AB, where for each cells in level (i,j=3) is involved. For example, the odds ratio (3) θ12AB=exp(λ12AB)=(μ12/μ13)(μ32/μ33)=μ12μ33μ13μ32.(3)

2.1.2. Models for three-way tables

Consider the loglinear model for three dichotomous variables, where the third variable is denoted as C. Here the saturated model is (4) logμijk=λ+λiA+λjB+λkC+λijAB+λikAC+λjkBC+λijkABC,(4) where λ is the intercept term, λiA λjB and λjC are respectively the parameters for variables A, B and C, λijAB, λikAC and λjkBC are the two factor interaction parameters and λijkABC is the three factor interaction parameter. In order to identify the parameters, similar identifying restrictions hold as for the two-variable case.

When three dichotomous variables are present, the odds ratio for each pairwise relationship is conditional on the outcome of the variable not included in the pairwise relationship. E.g. when there is a relationship between variables A and B, the odds ratio is conditional on variable C: (5) θAB|C=0=exp(λ11AB)=μ110μ000μ100μ010,(5) (6) θAB|C=1=exp(λ11AB+λ111ABC)=μ111μ001μ101μ011.(6)

2.2. Latent variable models

Loglinear models are also used in the context of latent variable models. The Rasch model, popular in the field of item response theory in psychometrics and in the field of population size estimation, assumes a single continuous latent variable that explains the dependence between observed dichotomous variables, i.e. the observed variables are conditionally independent given the latent variable. A version of the popular Rasch model [Citation25, originally published in 1960] is the so-called extended Rasch model [Citation3], that can be written for three observed variables as model (Equation4) with all two-factor interactions identical, i.e. λ11AB=λ11AC=λ11BC. In psychometric testing the variables A, B and C stand for item responses, and in the Rasch model individuals are ordered by the difficulty of the items and items are ordered by the aptitude of the individuals. For example, an individual may master item A but neither B nor the most difficult item C. Another individual masters A and B and thus has a higher aptitude than the first individual. Thus the model allows individuals to have heterogeneous probabilities to master items.

The latent class model assumes the existence of a categorical latent variable X with levels x, and given the level x the observed variables are independent. Let's assume that there are four observed variables A, B, C and D (having levels indexed by l) and a dichotomous variable X. Written in terms of (conditional) probabilities we have, (7) πijklABCD=ΣxπijklxABCDX=ΣxπxXπi|xAXπj|xBXπj|xCXπl|xDX.(7) This model was originally proposed by Lazersfeld and Henry [Citation26]. Haberman [Citation2] showed that the model can also be written as a loglinear model: (8) logμijklxABCDX=λ+λiA+λjB+λkC+λlD+λixAX+λjxBX+λkxCX+λlxDX,(8) with μijklABCD=ΣxμijklxABCDX. For the purpose of this paper, writing the latent class model as in Equation (Equation8) may not be that advantageous to generate a population, as Equation (Equation7) can be used for this purpose in a straightforward way by specifying the (conditional) probabilities on the r.h.s. of Equation (Equation7), thus generating population probabilities for the four-way table.

2.3. Logistic regression

Due to the relation between the loglinear model and the logistic regression model with categorical explanatory variables [Citation6, section 2.2.4], the method proposed in this paper can also be used for this subset of logistic regression models. We show the relation between the models for a simple example. Consider model (Equation4). Let the variable C with levels k = 0, 1 be the response variable and variables A and B be categorical variables. Then the logistic regression model is (9) logμij1μij0=logμij1logμij0=λkC+λikAC+λjkBC+λijkABC,(9) where the λ-parameters λkC,λikAC,λjkBC and λijkABC can be replaced by β-parameters (not involving k and C) β,βiA,βjB and βijAB respectively. The loglinear model [AB][C] corresponds with the intercept only model in logistic regression as λikAC=λjkBC=λijkABC=0 implies βiA=βjB=βijAB = 0.

Multinomial and ordinal response models can also be dealt with using this approach if they have equivalent loglinear models. Examples include the baseline category-logit model and the continuation-ratio logit model, see Agresti [Citation1].

2.4. The proposed method

Loglinear models are usually estimated using the maximum likelihood method. An important property of maximum likelihood estimates of loglinear models is that, for the parameters fitted, the corresponding margins of the observed counts equal the margins of the fitted counts [Citation6, p. 69]. These fitted counts are the sufficient statistics. We denote maximum likelihood estimates by adding a hat to parameters and expected counts.

We focus here on the estimation of hierarchical loglinear models, starting with the two-way table and extending to three-way tables later on. In hierarchical loglinear models, when higher-order parameters are included, the lower order parameters are included as well, for example, if λijAB is included, the lower order parameters λiA and λjB are included too. The use of hierarchical loglinear models allows for simple and easier interpretation of the model parameters.

2.4.1. Model fitting for two-way tables

For a two-way contingency table with observed counts nij, when we fit model (Equation1), the fitted values are equal to the observed counts, i.e. μ^ij=nij. This model is denoted by the variables constituting the highest fitted margins, i.e. [AB]. The margins nij are the sufficient statistic for the saturated model. For the independence model, denoted by [A][B], the margins of variable A and variable B are fitted, i.e. μ^ij=μ^i+μ^+j/μ^++=ni+n+j/n++, where a ‘+’ denotes that the sum is taken over the corresponding index.

Although the two most relevant models, the saturated model and the independence model, can be fitted in a straightforward way, we fit the model here through loglinear Poisson regression, as this suits our purposes later on. Assume that the four elements nij are counts derived from four Poisson distributions. Then a likelihood can be set up for μij and maximized over the parameters. If we collect the elements μij into a column vector, then model (Equation1) can be written as, (10) [log(μ11)log(μ10)log(μ01)log(μ00)]=[1111110010101000][λλ1Aλ1Bλ11AB],(10) where the matrix with 0's and 1's is the design matrix formulating the loglinear model. Maximizing the likelihood gives maximum likelihood estimates μ^ij and parameter estimates λ^,λ^1A,λ^1B and λ^11AB. In fitting the independence model, the last column of the design matrix, (1,0,0,0)T and the parameter λ11AB are left out of the equation. In software packages this can be accomplished in routines for Poisson regression, where the four observed counts nij are provided as the Y-variable, the model is represented by the design matrix in Equation (Equation10), and the output consists of four fitted values μ^ij and the parameter estimates.

2.4.2. Generating two-way tables with given properties

We now develop a model for the two-way contingency table with prespecified marginal probabilities and a prespecified interaction parameter. For the interaction parameter we want to make use of the odds ratio θAB defined in Equation (Equation2). This can be accomplished by using an offset in Poisson regression. An offset is a column in the design matrix for which no parameter is estimated, i.e. the parameter is fixed to 1. The odds ratio is specified to give the desired level of dependence between lists. If we denote the prespecified odds ratio by θ~AB, then the desired loglinear interaction parameter is λ~11AB=logθ~AB. Thus the population that we want to generate has to follow the following model: (11) logμij=λ+λiA+λjB+λ~ijAB.(11) The following equation, an adjusted version of Equation (Equation10), illustrates how this is implemented in Poisson regression: (12) [log(μ11)log(μ10)log(μ01)log(μ00)]=[111logθ~AB110010101000][λλ1Aλ1B1].(12) Notice that the parameter λ11AB from (Equation10) is set to 1, i.e. it is not estimated, and this value 1 is multiplied with the prespecified logθ~AB in the design matrix. In software packages for Poisson regression this can be accomplished as follows:

  • specify an offset vector which is logθ~AB when (i,j)=(1,1) and 0 otherwise.

  • use an independence model as the design matrix, as we only need estimates for λ,λiA and λjB.

  • in order to fit a model that leads to estimated counts that follow the prespecified marginal probabilities, calculate joint probabilities for which these marginal probabilities hold, and use these joint probabilities, multiplied with some constant, as the counts nij to be analyzed with model (Equation11).

  • then the estimates of expected frequencies can be used to derive the joint probabilities with the prespecified marginal probabilities and the prespecified odds ratio.

As an example, assume we want to generate population probabilities that have the prespecified marginal probabilities 0.7 and 0.8 and the prespecified odds ratio 2. Then the probabilities to be analyzed are 0.7 * 0.8 = 0.56, 0.7 * 0.2 = 0.14, 0.3 * 0.8 = 0.24 and 0.3 * 0.2 = 0.06, and multiply these probabilities with some number, such as 1000, to get counts. Thus 560, 140, 240 and 60 are the values of the dependent variable in Poisson regression. Specify log(2) as the offset and analyze the counts with the loglinear independence model and the offset. This yields the fitted values 584.8, 115.2, 215.2, 84.8 (rounded to 1 decimal place), and by dividing them by 1000 we have the four probabilities, with the prespecified marginal probabilities 0.7 and 0.8 and the prespecified odds ratio 2. The R-code we used can be found in the online Supplementary Material.

We now prove that for model (Equation11) the prespecified margins of the observed table are equal to those in the fitted table, irrespective of the offset. We follow Fienberg [Citation27, Appendix II]. If we assume that the counts are observations from independent Poisson distributions, then the likelihood function is proportional to (13) ijμijnijexp(μij),(13) and the kernel of the loglikelihood is, (14) ijnijlogμijijμij=ijnij(λ+λiA+λjB+λ~ijAB)ijμij.(14) At the maximum of the likelihood the derivative of the loglikelihood over the parameters is 0. So, for example, taking the derivative over λiA gives ni+μ^i+=0 and this gives ni+=μ^i+. Similarly for n+j=μ^+j. This completes the proof. It follows that including an offset for the interaction does not affect the property of ordinary loglinear models that for the fitted parameters the corresponding margins of the fitted values are equal to those of the observed values.

The model for the 3×3 table in Section 2.1.1 has the following implementation: (15) [log(μ11)log(μ12)log(μ13)log(μ21)log(μ22)log(μ23)log(μ31)log(μ32)log(μ33)]=[11010logθ~11AB11001logθ~12AB11000010110logθ~21AB10101logθ~22AB101000100100100010100010][λλ1Aλ2Aλ1Bλ2B1].(15)

2.4.3. Model fitting for three-way tables

For the saturated model with three variables, the fitted values are equal to the observed counts so μ^ijk=nijk. As an example of a restricted model, the estimates for the [AB][AC] model can be calculated directly as (16) μ^ijk=μ^ij+μ^i+kμ^i++=nij+ni+kni++.(16) However, for the loglinear model [AB][AC][BC] estimates must be calculated indirectly, iteratively (for example, [Citation6], p. 84).

We now develop (Equation4) in the form of Poisson regression. There are eight elements nijk, being counts from eight Poisson distributions. The likelihood can be set up for μijk and maximised over the parameters. Model (Equation4) can be denoted as, (17) [log(μ111)log(μ110)log(μ101)log(μ100)log(μ011)log(μ010)log(μ001)log(μ000)]=[1111111111101000110101001100000010110010101000001001000010000000][λλ1Aλ1Bλ1Cλ11ABλ11ACλ11BCλ111ABC].(17) As before, for (Equation17) the matrix with 0's and 1's is the design matrix formulating the loglinear model. A restricted model can be fitted by dropping columns in Equation (Equation17). For example, in fitting the model where B and C are independent given A, the last two columns of the design matrix and the last two parameters are left out of the equation.

2.4.4. Generating three-way tables with given properties

We now focus on loglinear models with prescribed odds ratios. We start with the model with no three-factor interaction. The model for a three-way contingency table with three prespecified odds ratios θ~AB,θ~AC and θ~BC is denoted as, (18) logμij=λ+λiA+λjB+λkC+logθ~AB+logθ~AC+logθ~BC.(18) In the adjusted version of (Equation17) this becomes (using that logθ~AB+logθ~AC+logθ~BC=logθ~ABθ~ACθ~BC): (19) [log(μ111)log(μ110)log(μ101)log(μ100)log(μ011)log(μ010)log(μ001)log(μ000)]=[1111logθ~ABθ~ACθ~BC1110logθ~AB1101logθ~AC110001011logθ~BC101001001010000][λλ1Aλ1Bλ1C1],(19) where the parameter for the offsets has been set to 1. In software packages for Poisson regression this can be accomplished as follows:

  • specify logθ~AB, logθ~AC and logθ~BC, and use them to generate the offset column of the design matrix as in (Equation19).

  • use an independence model in the design matrix, as we only need estimates for λ,λiA, λjB and λkC.

  • in order to fit a model that leads to estimated counts that follow prespecified marginal probabilities, calculate joint probabilities for which these marginal probabilities hold by multiplying the relevant probabilities. Use these joint probabilities, multiplied by some constant representing a notional population size, as the observed counts corresponding with nijk to be analyzed with model (Equation19).

  • then, as before in Section 2.4.1, the estimates of the expected frequencies can be used to derive the joint probabilities with the prespecified marginal probabilities and the prespecified odds ratios.

As an example, assume we want to generate population probabilities that have the prespecified marginal probabilities 0.8, 0.7 and 0.9 for variables A, B and C respectively, and with all the prespecified odds ratios =2. The probabilities following the independence model are 0.7 * 0.8 * 0.9 = 0.504, 0.7 * 0.8 * 0.1 = 0.056, 0.7 * 0.2 * 0.9 = 0.216, 0.7 * 0.2 * 0.1 = 0.024, 0.3 * 0.8 * 0.9 = 0.126, 0.3 * 0.8 * 0.1 = 0.014, 0.3 * 0.2 * 0.9 = 0.054 and 0.3 * 0.2 * 0.1 = 0.006 and we multiply these probabilities by 1,000 to get counts 504, 56, 216, 24, 126, 14, 54 and 6 which form the values of the dependent variable in Poisson regression. Specify log(2) as the three offsets and analyze the counts with the loglinear independence model and the offset as in Equation (Equation19). This yields the fitted values (rounded to 1 decimal place) 547.3, 39.5, 186.3, 26.9, 99.0, 14.3, 67.4 and 19.4. By dividing these by 1000 we have the eight probabilities with the prespecified marginal probabilities 0.7, 0.8 and 0.9 and the three prespecified odds ratios 2. For example, the marginal probability of being in variable A equals the sum of the fitted values for A = 1 divided by 1000, which is 0.8 and the conditional odds θBC|A=1=(547.326.9)/(39.5186.3)=2. The R-code we used can be found in the online Supplementary Material.

We now discuss the model with a prespecified three-factor interaction. Above we discussed model fitting for three-way tables, with no three-factor interaction. Here we develop the model to include a prespecified three-factor interaction parameter in a loglinear model. The three-factor interaction is defined as the state where two conditional odds ratios are not equal, for example, θ~AB|C=0θ~AB|C=1. These conditional odds ratios are related to loglinear parameters by θ~AB|C=0=expλ~11AB and θ~AB|C=1=exp[λ~11AB+λ~111ABC]. As similar results hold for θ~AC|B=0, θ~AC|B=1, θ~BC|A=0 and θ~BC|A=1 this shows that the three-way interaction is represented by expλ~111ABC. A three-way interaction will be handled as follows. In Equation (Equation19) we fill in logθ~AB|C=0 in cell (2,5) of the design matrix and logθ~AB|C=1 in cell (1,5) of the design matrix, and similarly for the other conditional odds ratios. Thus, the model with prespecified conditional odds ratios can be presented as, (20) [log(μ111)log(μ110)log(μ101)log(μ100)log(μ011)log(μ010)log(μ001)log(μ000)]=[1111logθ~AB|C=1θ~AC|B=1θ~BC|A=11110logθ~AB|C=01101logθ~AC|B=0110001011logθ~BC|A=0101001001010000][λλ1Aλ1Bλ1C1].(20) We can trivially write (21) θ~AB|C=1=θ~AB|C=0θ~AB|C=1θ~AB|C=0,(21) the term [θ~AB|C=1/θ~AB|C=0], which is the difference in odds for C = 1 and C = 0, is equal to expλ~111ABC. The terms θ~AC|B=1 and θ~BC|A=1 include this same three-way interaction term, i.e. (22) θ~AC|B=1=θ~AC|B=0θ~AB|C=1θ~AB|C=0,(22) (23) θ~BC|A=1=θ~BC|A=0θ~AB|C=1θ~AB|C=0.(23) In fact since the three-factor interaction parameter λ~111ABC is the difference between each pair of conditional odds ratios, it is clear that (24) θAB|C=1θAB|C=0=θAC|B=1θAC|B=0=θBC|A=1θBC|A=0.(24) If we specify the denominators of each fraction, then as soon as one numerator is specified then all the fractions are determined. There are therefore only four free odds ratios in Equation (Equation24), and the three-factor interaction can be specified by providing all the denominators and any one of the numerators. Therefore we can rewrite element (1,5) of the design matrix in (Equation20) as (25) logθ~AB|C=1θ~AC|B=1θ~BC|A=1=logθ~AB|C=0θ~AC|B=0θ~BC|A=0[θ~AB|C=1θ~AB|C=0]3,(25) using only four odds ratios.

As an example, assume that we want to create population probabilities with conditional odds ratios θ~AB|C=0=2 and θ~AB|C=1=3. It follows that the three factor term is [θ~AB|C=1/θ~AB|C=0]=3/2. Then, let θ~AC|B=1=1, and it follows that θ~AC|B=0[θ~AB|C=1/θ~AB|C=0]=1, so that θ~AC|B=0=2/3. Similarly, if θ~BC|A=1=1, then θ~BC|A=0=2/3. Thus the model equation becomes (26) [log(μ111)log(μ110)log(μ101)log(μ100)log(μ011)log(μ010)log(μ001)log(μ000)]=[1111log31110log21101log(2/3)110001011log(2/3)101001001010000][λλ1Aλ1Bλ1C1].(26)

3. Application to population size estimation

3.1. Dual system estimation

Nowadays dual system estimation methods are commonly used to estimate population totals for domains of interest. The dual system estimator is the simplest capture-recapture model and is used to estimate the part of the population that is missing after having linked individuals in two overlapping lists. Originally this method was implemented to estimate the size and density of wildlife populations, but it has also been developed and used in official statistics, where the dual-system estimator was originally used for the 1950 decennial United States Census to estimate coverage error [Citation28].

The linked data can be presented in a two-way contingency table: we observe the counts of those in both lists, those in list A only, and those in list B only, but not those missing from both lists. Thus there are three observed cell counts, and the missing count for those in neither of the lists is to be estimated, which can conveniently be done using the loglinear model framework [Citation29].

Dual system estimation relies on five key assumptions, outlined by the International Working Group for Disease Monitoring and Forecasting [Citation30]:

  1. The population is closed, i.e. there is no change such as births or deaths.

  2. There is perfect matching between the two lists.

  3. The capture/inclusion probability of elements in at least one of the lists is homogeneous [Citation31,Citation32].

  4. The probabilities of inclusion in the lists are independent.

  5. There are no erroneous captures in any list (no overcoverage).

One of the key assumptions is that the probabilities of inclusion in the lists are independent, which is easily violated [Citation24]. However, with the data used for population size estimation, this assumption is impossible to verify and so the design of both lists is very important, to create the conditions where this assumption is likely to hold. A way to reduce the impact of violating this strict assumption is to divide the population by certain characteristics, such as age, sex and some geographical breakdown, so that we need only assume that the lists are independent conditional on these classifiers.

Assumption 4 follows on from assumption 3, which implies that individuals who where included in the first list and those who were not have the same probability of being included in the second list, so that presence in the first list does not affect response in the second list, and the lists are therefore statistically independent [Citation30]. Therefore, as there are three observed counts, to estimate cell counts μ^ij for a two-way contingency table only three parameters can be used and no interaction term can be modelled, and (Equation1) is adjusted to include no interaction term, i.e. λijAB=0. Independence implies that the odds ratio (Equation2) is 1, so that the missing count is estimated by μ^00=μ^10μ^01/μ^11=n10n01/n11.

3.2. Triple system estimation

The triple system estimator uses three linked lists. Counts can be summarized in a three-way contingency table where one count is missing, namely the count for the individuals that are missed by all three lists. By estimating this count and adding it to the seven observed counts, the population size is estimated. Unlike a two-way contingency table, in a three-way contingency table the pairwise interactions between lists can be modelled. This means that the homogeneous capture/inclusion probabilities and the independence assumption can be relaxed, while the other assumptions outlined for dual system estimation still hold.

As there are seven cells with observed counts, only loglinear models with up to seven independent parameters can be fitted. This is usually solved by assuming the three-factor interaction term λ^ijkABC=0. Including two-factor interaction terms in the model enables us to estimate population totals where pairwise relationships are present [Citation6]. For this paper, we are interested in fitting the maximal model for a three-way contingency table with one unobserved cell, which is an adjusted version of (Equation4) with no three factor interaction term.

3.3. How to apply the proposed methodology to population size estimation

The following outlines how to generate contingency tables with prespecified margins and interactions between lists. Such generated contingency tables can be used in simulations to study the behaviour of estimators for the population size, as the true population size is known.

To generate a contingency table where lists are independent, the total population size and inclusion probabilities for each list are prespecified. The cell counts are then computed by multiplying the marginal inclusion probabilities and the population size. This results in cell counts that follow statistical independence between lists.

To generate a table with a given dependence structure, an independent loglinear model is then fitted to these cell counts with the inclusion of an offset term, as presented in (Equation11) and (Equation18). Then the fitted counts have the following properties:

  1. Their sum is equal to the prespecified population size,

  2. The prespecified odds ratios hold, and

  3. The prespecified marginal response probabilities hold.

In order to investigate the behaviour of population size estimators, multinomial samples are drawn from this population. For the study of the dual system estimator, in each sample the cell (0,0) is set to missing and the estimator is applied. For the study of the triple systems estimator, in each sample the cell (0,0,0) is set to missing and the estimator is applied. When t samples are drawn, t population size estimates are obtained that can be compared with the prespecified population size.

3.4. Simulation study

The aim of the simulation study is to illustrate the methodology. Probabilities are generated using prespecified marginal probabilities and odds ratios. 2000 multinomial samples are drawn with these probabilities and with the total population size set equal to 1000 or 2000. The prespecified probabilities and odds ratios mirror those used by Brown et al. [Citation22] in their investigation of methods to deal with dependence in the coverage estimation for the 2001 population census of England and Wales. The outcomes of the simulations are presented in Tables  and . The results include the specified values for the total population, the marginal response probabilities, the odds ratio (OR) between the lists, the mean and median of the population size estimates, empirical 95% confidence intervals (which is the range between the 2.5% and 97.5% quantiles of the estimates), the relative bias %, (μ^N)/N and the Coefficient of Variation σ^/N, where N is the prespecified population size, μ^ is the mean and σ^ is the standard deviation of the population size estimates over the simulations.

Table 1. Simulation results for dual systems estimation with inclusion probabilities πA and πB, sizes of odds ratios (OR), the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias and coefficient of variation (CV), where the fitted model is [A][B].

Table 2. Simulation results for triple systems estimation with inclusion probabilities πA = 0.8, πB = 0.7 and πC = 0.9, with different sizes of odds ratios (OR), specified models, the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias and coefficient of variation (CV).

3.4.1. Two-way table

We set the response probabilities to either, for list A (πA) 0.8 and for list B (πB) 0.7; or for list A (πA) 0.9 and for list B (πB) 0.8. The joint probabilities (π11,π10,π01,π00) are, respectively, (0.56, 0.24, 0.14, 0.06) and (0.72, 0.18, 0.08 and 0.02). Cell counts are obtained by μij=(Nπij).

The results presented in Table  from these simulations show, that when the lists are independent (OR = 1), and the independence model is fitted, the mean of the estimated population size across 2000 multinomial samples is very close to the true population size. However, the bias appears to depend on the size of the population and the response probabilities for each list. When the total population is smaller, N = 1000, and the response probabilities are lower, πA = 0.7 and πB = 0.8, for the lists, the mean of the estimated population sizes is slightly larger than the true population size. However, the bias is negligible when N = 1000 and πA = 0.8 and πB = 0.9, as the mean estimated population total across all multinomial samples is approximately equal to the true population total. Therefore, the higher the response probabilities of the lists, the less biased the estimates of the population size will be.

When the lists are dependent, and the independence model is fitted, the mean estimated population size is smaller than the true population size. This shows that positive odds ratios result in population size estimates being negatively biased (compare with [Citation24]). In some cases, when the odds ratio = 1.1, the true population size falls within the confidence interval. This is an issue of power, as the misspecification of the model is small, and the number of simulations is apparently not large enough to detect it.

3.4.2. Three-way table

In our simulation, the three lists have response probabilities 0.8 for list A (πA), 0.7 for list B (πB) and 0.9 for list C (πC). The joint probabilities (π111,π110,π101,π100, π011,π010,π001,π000) are, respectively, (0.504, 0.056, 0.216, 0.024, 0.126, 0.054, 0.014, 0.006). Cell counts are obtained by μijk=(Nπijk).

The results of the simulation are shown in Table . The model used to generate the population data is specified by the odds ratios in the column OR, and the model used to analyze the data in the column Model. Thus there are correctly specified models, such as OR = (1,1,1) and Model is [A][B][C], overparameterised models, such as OR = (1,1,1) and Model is [AB][AC][BC], and misspecified models, such as OR = (1.5, 2, 1) and Model is [A][BC]. When the model is specified correctly the relative bias is close to zero. The variance of the estimates, decreases as the population size increases, compare, for example, OR = (1,1,1) and Model [A][B][C] with population sizes 1000 and 2000. When the model is overparameterised, the variance is larger than when the model is correctly specified.

4. Discussion

We proposed a simple method to generate contingency tables with fixed marginal probabilities and dependence structures described by loglinear models. We discuss the generation of two-way and three-way contingency tables and this can be extended to in a straightforward way.

We specify the loglinear model in terms of Poisson regression with a log link. Using an offset, prespecified interaction parameters can be fitted that can be understood in terms of odds ratios. For two-way tables, odds ratios describe dependence between two variables, and conditional odds ratios describe partial dependence between three or more variables in higher-way tables.

We think that this method is simpler than previously proposed methods which are discussed in the introduction, and it can be extended to other models that can be presented as loglinear models, such as logistic regression models with categorical predictors and some latent variable models. The structure of our approach to generating data is the same as the standard approach for undertaking multiple system estimation, so it is straightforward to understand how the dependence is introduced. Our approach is less tailored to the generalized estimating equations which have been the main application of other approaches.

In this paper we apply this method to population size estimation for two-way and three-way contingency tables. The results show the impacts of dependence between lists and model (mis)specification.

Supplemental material

Supplemental Material

Download PDF (43.5 KB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Agresti A. Categorical data analysis. New Jersey: Wiley; 2013.
  • Haberman SJ. A stabilized Newton–Raphson algorithm for log-linear models for frequency tables derived by indirect observation. Sociol Methodol. 1988;18:193–211. doi: 10.2307/271049
  • Hessen DJ. Loglinear representations of multivariate Bernoulli Rasch models. Br J Math Stat Psychol. 2011;64:337–354. doi: 10.1348/2044-8317.002000
  • Nelder JA, Wedderburn RWM. Generalized linear models. J R Stat Soc Ser A. 1972;135(3):370–384. doi: 10.2307/2344614
  • Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J R Stat Soc Ser B. 1984;46:149–170. doi: 10.1111/j.2517-6161.1984.tb01288.x
  • Bishop YM, Fienberg SE, Holland PW. Discrete multivariate analysis, theory and practice. New York: McGraw-Hill; 1975.
  • Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Ann Math Stat. 1963;34(3):911–934. doi: 10.1214/aoms/1177704014
  • Lee AJ. Generating random binary deviates having fixed marginal distributions and specified degrees of association. Am Stat. 1993;47:209–215. doi: 10.1080/00031305.1993.10475980
  • Gange SJ. Generating multivariate categorical variates using the iterative proportional fitting algorithm. Am Stat. 1995;49(2):134–138. doi: 10.1080/00031305.1995.10476130
  • Choi HJ. A simple method for constructing multidimensional distributions of correlated categorical data. Commun Stat Simul Comput. 2008;37(7):1377–1384. doi: 10.1080/03610910802001863
  • Kateri M. Contingency table analysis. New York: Springer; 2014.
  • Geenens G. Copula modeling for discrete random vectors. Depend Model. 2020;8(1):417–440. doi: 10.1515/demo-2020-0022
  • Emrich LJ, Piedmonte MR. A method for generating high-dimensional multivariate binary variates. Am Stat. 1991;45(4):302–304. doi: 10.1080/00031305.1991.10475828
  • Park CG, Park T, Shin DW. A simple method for generating correlated binary variates. Am Stat. 1996;50(4):306–310. doi: 10.1080/00031305.1996.10473557
  • Lee AJ. Some simple methods for generating correlated categorical variates. Comput Stat Data Anal. 1997;26(2):133–148. doi: 10.1016/S0167-9473(97)00030-3
  • Lunn AD, Davies SJ. A note on generating correlated binary variables. Biometrika. 1998;85(2):487–490. doi: 10.1093/biomet/85.2.487
  • Al Osh MA, Lee SJ. A simple approach for generating correlated binary variates. J Stat Comput Simul. 2001;70(3):231–255. doi: 10.1080/00949650108812119
  • Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. J Stat Comput Simul. 2006;76(11):1017–1025. doi: 10.1080/10629360600569246
  • Jiang W, Song S, Hou L, et al. A set of efficient methods to generate high-dimensional binary data with specified correlation structures. Am Stat. 2021;75(3):310–322. doi: 10.1080/00031305.2020.1816213
  • Amatya A, Demirtas H. Multiord: an R package for generating correlated ordinal data. Commun Stat Simul Comput. 2015;44(7):1683–1691. doi: 10.1080/03610918.2013.824097
  • Touloumis A. Simulating correlated binary and multinomial responses under marginal model specification: the SimCorMultRes package. R J. 2016;8(2):79–91. doi: 10.32614/RJ-2016-034
  • Brown J, Abbott O, Diamond I. Dependence in the 2001 one-number census project. J R Stat Soc Ser A Stat Soc. 2006;169:883–902. doi: 10.1111/j.1467-985X.2006.00431.x
  • Baffour B, Brown JJ, Smith PWF. An investigation of triple system estimators in censuses. Stat J IAOS. 2013;29:53–68. doi: 10.3233/SJI-130760.
  • Gerritse SC, Bakker BFM, van der Heijden PGM. Different methods to complete datasets used for capture–recapture estimation: estimating the number of usual residents in the Netherlands. Stat J IAOS. 2015;31:613–627. doi: 10.3233/SJI-150938
  • Rasch G. Probabilistic models for some intelligence and attainment tests. The University of Chicago Press; Originally published 1960, Copenhagen: The Danish Institute of Educational Research, Chicago; 1980.
  • Lazarsfeld PF, Henry NW. Latent structure analysis. Boston: Houghton Mifflin; 1968.
  • Fienberg SE. The analysis of cross-classified categorical data. 2nd ed. Cambridge: Massachusetts Institute of Technology Press; 1980.
  • Wolter KM. Some coverage error models for census data. J Am Stat Assoc. 1986;81(394):337–346. doi: 10.1080/01621459.1986.10478277
  • Fienberg SE. The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika. 1972;59:409–439. doi: 10.1093/biomet/59.3.591.
  • International Working Group for Disease Monitoring and Forecasting. Capture–recapture and multiple record systems estimation. Part I. History and theoretical development. Am J Epidemiol. 1995;142:1059–1068. doi: 10.1093/oxfordjournals.aje.a117559
  • Seber GAF. The estimation of animal abundance and related parameters. Caldwell (NJ): Blackburn Press; 1982.
  • Van der Heijden PGM, Whittaker J, Cruyff M, et al. People born in the Middle East but residing in the Netherlands: invariant population size estimates and the role of active and passive covariates. Ann Appl Stat. 2012;6:831–852. doi: 10.1214/12-AOAS536