Full article: Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We present a method to generate contingency tables that follow loglinear models with prescribed marginal probabilities and dependence structures. We make use of (loglinear) Poisson regression, where the dependence structures, described using odds ratios, are implemented using an offset term. Other statistical models related to loglinear models that fall into the scope of this paper, such as the logistic regression model, the latent class model and the extended Rasch are discussed as well. We apply this methodology to carry out simulation studies in the context of population size estimation using dual system and triple system estimators, popular in official statistics. These estimators use contingency tables that summarize the counts of elements enumerated or captured within lists that are linked. The simulation is used to investigate these estimators in the situation that the model assumptions are fulfilled, and the situation that the model assumptions are violated.

Keywords:

1. Introduction

In simulation studies that make use of contingency tables, samples are drawn from a population with properties that are well understood. In this paper a method is proposed for the generation of contingency tables with fixed marginal probabilities and prescribed dependence structures that are defined in terms of odds ratios. An odds ratio describes the strength of the association for a two-way contingency table. For multi-way tables conditional odds ratios play the same role for partial associations between variables [Citation1]. A simple example of such a population would be a two-way contingency table with marginal probabilities of 0.7 and 0.8 and the dependence represented by an odds ratio of 2.

Loglinear models have interaction parameters that can be directly understood in terms of odds ratios. Therefore the proposed method is built around the use of loglinear models. Models with two and three variables are explored, where each variable is dichotomous. However, the methodology presented can generate populations using loglinear models of any size, in terms of both the number of variables and the number of levels of the variables. Also, we can generate populations following logistic regression models having only categorical explanatory variables, and latent variable models that can be formulated as loglinear models, such as the latent class model [Citation2] and the extended Rasch model [Citation3].

There are a few ways to fit loglinear models. It suits our purposes to define loglinear models in terms of Poisson regression models with a log link [Citation4]. Such models can be fitted with, for example, iteratively reweighted least squares (IRLS) [Citation5]. The models we use require the marginal probabilities and odds ratios as input parameters. As a first step, the given marginal probabilities are used to generate the probabilities in a loglinear model for which complete independence holds [Citation6,Citation7, p. 345]. The prespecified interaction structure is imposed by using an offset, which is an additional explanatory variable with a prespecified parameter value that is fixed and not estimated.

When each margin of a contingency table has only two categories, the contingency table is a summary of the observations of correlated binary variates. Previously, Lee [Citation8] proposed a linear programming method for this problem. In comparison to his method, our method is simpler. Also, we present our method in the context of various applications of the loglinear model where conditional odds ratios can also be specified, and of latent variable models. The work of Gange [Citation9] is related to our work in the sense that he uses the iterative proportional fitting algorithm, an algorithm that was very popular in the early ages of loglinear modelling; yet he does not make a link to loglinear models and mentions odds rations only in passing. Choi [Citation10] also presented a method based on loglinear models, where the association is formulated in terms of the Pearson chi-square coefficient, and iterative proportional fitting is used in the generation of probabilities. Thus the way in which the association is specified by Choi is different from our method, that is more closely related to the way in which the loglinear model is defined. Similar to our work, Kateri [Citation11, section 2.2.5, p. 43] described that local odds ratios and row and column marginal distributions uniquely determine the table of joint probabilities, which also holds for any other minimal set of odds ratios. Our work further develops this and demonstrates a simple approach to generating data from this information. Geenens [Citation12] sets out a similar approach into a framework where the dependence is measured using copulas. He explains that the bivariate distributions are uniquely determined by a ‘margin-free’ association parameter (and that multiple different kinds of parameter are possible, but with one-to-one relations between them.

There is also much related but different work on generating correlated binary variates, see, for example, Emrich and Piedmonte [Citation13], Lee [Citation8], Gange [Citation9], Park et al. [Citation14], Lee [Citation15], Lunn and Davies [Citation16], Al Osh and Lee [Citation17], Demirtas [Citation18] and Jiang et al. [Citation19]. There is also some (though rather less) work on generating correlated multinomial or ordinal variables which can be used when the margins have more than two categories, for example Gange [Citation9], Choi [Citation10], Amatya and Demirtas [Citation20] and Touloumis [Citation21]. These approaches mostly focus on generating population probabilities in order to assess the performance of generalized estimating equation (GEE) models. In passing, some of these papers mention that their methodology can also be used with odds ratios instead of correlations, but they do not provide a detailed description of how this works. Geenens [Citation12] extends the connection of copulas and margin-free association measures to general discrete distributions. Combined with marginal information these generate distributions and Geenens demonstrates their existence and uniqueness. Our work is different in that we focus on generating data from loglinear models where the dependence is specified using odds ratios rather than correlations, and describe how this leads to a simple implementation using standard functions for fitting loglinear models.

The methodology we propose is illustrated with an application to population size estimation, as has been used in population censuses. This follows on from the work of Brown et al. [Citation22], Baffour et al. [Citation23] and Gerritse et al. [Citation24] who used a fixed odds ratio to investigate the impacts of dependence on population size estimation. In this application, lists where individuals within the target population are enumerated are used to estimate the population sizes. These lists are subject to incompleteness (undercoverage error), and multiple system estimation has been proposed to adjust for this undercoverage error. Some assumptions about the dependence of the lists must be made (though these may be quite weak where there are many lists), so it is important to test these methods under prespecified conditions. The proposed methodology and its application allow investigations into the impacts of given dependence structures within contingency tables to be carried out, varying the sizes of odds ratios, population sizes and response probabilities, and under different models.

The structure of the paper is as follows. In Section 2 we outline the methodology around log linear models and model fitting. From this we develop the proposed method for including prespecified interaction parameters in Section 2.4. In Section 3 we present an application where the proposed method is used to investigate the properties of population size estimation under dependence. A simulation study is presented for two-way and three-way contingency tables, alongside steps for users to implement the methodology. Section 4 makes an assessment of the advantages of our approach compared with existing methods for correlated binary or categorical data.

2. Methodology

2.1. Loglinear models

2.1.1. Models for two-way tables

The simplest situation that we consider is the loglinear model with two dichotomous variables A and B. Let A be indexed by i = 0, 1 and B by j = 0, 1. We denote the expected count for cell $(i, j)$ of the two-way contingency table of variables A and B by $μ_{ij}$ . We denote the loglinear parameters by λ. Then the saturated loglinear model for two lists is: (1) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{ij}^{AB},$ (1) where λ is the intercept term, $λ_{i}^{A}$ and $λ_{j}^{B}$ are the respective parameters for variables A and B, and $λ_{ij}^{AB}$ is the two factor interaction parameter. The parameters are identified by setting the parameter equal to 0 when an index is 0, i.e. $λ_{0}^{A} = λ_{0}^{B} = λ_{00}^{AB} = λ_{01}^{AB} = λ_{10}^{AB} = 0.$ Thus the model has four parameters and four observed counts, and hence is called saturated. The loglinear interaction parameter $λ_{11}^{AB}$ is closely related to odds ratio $θ^{AB}$ and the expected counts by (2) $θ^{AB} = exp (λ_{11}^{AB}) = \frac{(μ_{11} / μ_{01})}{(μ_{10} / μ_{00})} = \frac{μ_{11} μ_{00}}{μ_{01} μ_{10}} .$ (2) When $λ_{ij}^{AB} = 0$ , the odds ratio $θ^{AB} = 1$ , and the independence model holds.

The proposed methodology can be extended in a straightforward way to variables with more than two levels (compare [Citation11], section 2). Assume a $3 \times 3$ table with levels i, j = 1, 2, 3, with level 3 being the reference category. The model is described in Equation (Equation1(1) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{ij}^{AB},$ (1) ) with i, j now having different levels than in the $2 \times 2$ context. If we identify the 9 interaction parameters by setting these parameters equal to 0 when $(i, j = 3)$ , then there are four interaction parameters to be estimated, namely $λ_{11}^{AB}, λ_{12}^{AB}, λ_{21}^{AB}$ and $λ_{22}^{AB}$ . Thus there are four odds ratios, $θ_{11}^{AB}, θ_{12}^{AB}, θ_{21}^{AB}$ and $θ_{22}^{AB}$ , where for each cells in level $(i, j = 3)$ is involved. For example, the odds ratio (3) $θ_{12}^{AB} = \exp (λ_{12}^{AB}) = \frac{(μ_{12} / μ_{13})}{(μ_{32} / μ_{33})} = \frac{μ_{12} μ_{33}}{μ_{13} μ_{32}} .$ (3)

2.1.2. Models for three-way tables

Consider the loglinear model for three dichotomous variables, where the third variable is denoted as C. Here the saturated model is (4) $\log μ_{ijk} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{ij}^{AB} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (4) where λ is the intercept term, $λ_{i}^{A}$ $λ_{j}^{B}$ and $λ_{j}^{C}$ are respectively the parameters for variables A, B and C, $λ_{ij}^{AB}$ , $λ_{ik}^{AC}$ and $λ_{jk}^{BC}$ are the two factor interaction parameters and $λ_{ijk}^{ABC}$ is the three factor interaction parameter. In order to identify the parameters, similar identifying restrictions hold as for the two-variable case.

When three dichotomous variables are present, the odds ratio for each pairwise relationship is conditional on the outcome of the variable not included in the pairwise relationship. E.g. when there is a relationship between variables A and B, the odds ratio is conditional on variable C: (5) $\begin{aligned} θ^{AB | C = 0} & = exp (λ_{11}^{AB}) = \frac{μ_{110} μ_{000}}{μ_{100} μ_{010}}, \end{aligned}$ (5) (6) $\begin{aligned} θ^{AB | C = 1} & = exp (λ_{11}^{AB} + λ_{111}^{ABC}) = \frac{μ_{111} μ_{001}}{μ_{101} μ_{011}} . \end{aligned}$ (6)

2.2. Latent variable models

Loglinear models are also used in the context of latent variable models. The Rasch model, popular in the field of item response theory in psychometrics and in the field of population size estimation, assumes a single continuous latent variable that explains the dependence between observed dichotomous variables, i.e. the observed variables are conditionally independent given the latent variable. A version of the popular Rasch model [Citation25, originally published in 1960] is the so-called extended Rasch model [Citation3], that can be written for three observed variables as model (Equation4(4) $\log μ_{ijk} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{ij}^{AB} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (4) ) with all two-factor interactions identical, i.e. $λ_{11}^{AB} = λ_{11}^{AC} = λ_{11}^{BC}$ . In psychometric testing the variables A, B and C stand for item responses, and in the Rasch model individuals are ordered by the difficulty of the items and items are ordered by the aptitude of the individuals. For example, an individual may master item A but neither B nor the most difficult item C. Another individual masters A and B and thus has a higher aptitude than the first individual. Thus the model allows individuals to have heterogeneous probabilities to master items.

The latent class model assumes the existence of a categorical latent variable X with levels x, and given the level x the observed variables are independent. Let's assume that there are four observed variables A, B, C and D (having levels indexed by l) and a dichotomous variable X. Written in terms of (conditional) probabilities we have, (7) $π_{ijkl}^{ABCD} = Σ_{x} π_{ijklx}^{ABCDX} = Σ_{x} π_{x}^{X} π_{i | x}^{AX} π_{j | x}^{BX} π_{j | x}^{CX} π_{l | x}^{DX} .$ (7) This model was originally proposed by Lazersfeld and Henry [Citation26]. Haberman [Citation2] showed that the model can also be written as a loglinear model: (8) $\log μ_{ijklx}^{ABCDX} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{l}^{D} + λ_{ix}^{AX} + λ_{jx}^{BX} + λ_{kx}^{CX} + λ_{lx}^{DX},$ (8) with $μ_{ijkl}^{ABCD} = Σ_{x} μ_{ijklx}^{ABCDX}$ . For the purpose of this paper, writing the latent class model as in Equation (Equation8(8) $\log μ_{ijklx}^{ABCDX} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{l}^{D} + λ_{ix}^{AX} + λ_{jx}^{BX} + λ_{kx}^{CX} + λ_{lx}^{DX},$ (8) ) may not be that advantageous to generate a population, as Equation (Equation7(7) $π_{ijkl}^{ABCD} = Σ_{x} π_{ijklx}^{ABCDX} = Σ_{x} π_{x}^{X} π_{i | x}^{AX} π_{j | x}^{BX} π_{j | x}^{CX} π_{l | x}^{DX} .$ (7) ) can be used for this purpose in a straightforward way by specifying the (conditional) probabilities on the r.h.s. of Equation (Equation7(7) $π_{ijkl}^{ABCD} = Σ_{x} π_{ijklx}^{ABCDX} = Σ_{x} π_{x}^{X} π_{i | x}^{AX} π_{j | x}^{BX} π_{j | x}^{CX} π_{l | x}^{DX} .$ (7) ), thus generating population probabilities for the four-way table.

2.3. Logistic regression

Due to the relation between the loglinear model and the logistic regression model with categorical explanatory variables [Citation6, section 2.2.4], the method proposed in this paper can also be used for this subset of logistic regression models. We show the relation between the models for a simple example. Consider model (Equation4(4) $\log μ_{ijk} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{ij}^{AB} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (4) ). Let the variable C with levels k = 0, 1 be the response variable and variables A and B be categorical variables. Then the logistic regression model is (9) $\log \frac{μ_{ij 1}}{μ_{ij 0}} = \log μ_{ij 1} - \log μ_{ij 0} = λ_{k}^{C} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (9) where the λ-parameters $λ_{k}^{C}, λ_{ik}^{AC}, λ_{jk}^{BC}$ and $λ_{ijk}^{ABC}$ can be replaced by β-parameters (not involving k and C) $β, β_{i}^{A}, β_{j}^{B}$ and $β_{ij}^{AB}$ respectively. The loglinear model $[AB] [C]$ corresponds with the intercept only model in logistic regression as $λ_{ik}^{AC} = λ_{jk}^{BC} = λ_{ijk}^{ABC} = 0$ implies $β_{i}^{A} = β_{j}^{B} = β_{ij}^{AB}$ = 0.

Multinomial and ordinal response models can also be dealt with using this approach if they have equivalent loglinear models. Examples include the baseline category-logit model and the continuation-ratio logit model, see Agresti [Citation1].

2.4. The proposed method

Loglinear models are usually estimated using the maximum likelihood method. An important property of maximum likelihood estimates of loglinear models is that, for the parameters fitted, the corresponding margins of the observed counts equal the margins of the fitted counts [Citation6, p. 69]. These fitted counts are the sufficient statistics. We denote maximum likelihood estimates by adding a hat to parameters and expected counts.

We focus here on the estimation of hierarchical loglinear models, starting with the two-way table and extending to three-way tables later on. In hierarchical loglinear models, when higher-order parameters are included, the lower order parameters are included as well, for example, if $λ_{ij}^{AB}$ is included, the lower order parameters $λ_{i}^{A}$ and $λ_{j}^{B}$ are included too. The use of hierarchical loglinear models allows for simple and easier interpretation of the model parameters.

2.4.1. Model fitting for two-way tables

For a two-way contingency table with observed counts $n_{ij}$ , when we fit model (Equation1(1) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{ij}^{AB},$ (1) ), the fitted values are equal to the observed counts, i.e. ${\hat{μ}}_{ij} = n_{ij}$ . This model is denoted by the variables constituting the highest fitted margins, i.e. [AB]. The margins $n_{ij}$ are the sufficient statistic for the saturated model. For the independence model, denoted by [A][B], the margins of variable A and variable B are fitted, i.e. ${\hat{μ}}_{ij} = {\hat{μ}}_{i +} {\hat{μ}}_{+ j} / {\hat{μ}}_{+ +} = n_{i +} n_{+ j} / n_{+ +}$ , where a ‘+’ denotes that the sum is taken over the corresponding index.

Although the two most relevant models, the saturated model and the independence model, can be fitted in a straightforward way, we fit the model here through loglinear Poisson regression, as this suits our purposes later on. Assume that the four elements $n_{ij}$ are counts derived from four Poisson distributions. Then a likelihood can be set up for $μ_{ij}$ and maximized over the parameters. If we collect the elements $μ_{ij}$ into a column vector, then model (Equation1(1) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{ij}^{AB},$ (1) ) can be written as, (10) $[\begin{matrix} \log (μ_{11}) \\ \log (μ_{10}) \\ \log (μ_{01}) \\ \log (μ_{00}) \end{matrix}] = [\begin{array}{cccc} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{11}^{AB} \end{matrix}],$ (10) where the matrix with 0's and 1's is the design matrix formulating the loglinear model. Maximizing the likelihood gives maximum likelihood estimates ${\hat{μ}}_{ij}$ and parameter estimates $\hat{λ}, {\hat{λ}}_{1}^{A}, {\hat{λ}}_{1}^{B}$ and ${\hat{λ}}_{11}^{AB}$ . In fitting the independence model, the last column of the design matrix, $(1, 0, 0, 0)^{T}$ and the parameter $λ_{11}^{AB}$ are left out of the equation. In software packages this can be accomplished in routines for Poisson regression, where the four observed counts $n_{ij}$ are provided as the Y-variable, the model is represented by the design matrix in Equation (Equation10(10) $[\begin{matrix} \log (μ_{11}) \\ \log (μ_{10}) \\ \log (μ_{01}) \\ \log (μ_{00}) \end{matrix}] = [\begin{array}{cccc} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{11}^{AB} \end{matrix}],$ (10) ), and the output consists of four fitted values ${\hat{μ}}_{ij}$ and the parameter estimates.

2.4.2. Generating two-way tables with given properties

We now develop a model for the two-way contingency table with prespecified marginal probabilities and a prespecified interaction parameter. For the interaction parameter we want to make use of the odds ratio $θ^{AB}$ defined in Equation (Equation2(2) $θ^{AB} = exp (λ_{11}^{AB}) = \frac{(μ_{11} / μ_{01})}{(μ_{10} / μ_{00})} = \frac{μ_{11} μ_{00}}{μ_{01} μ_{10}} .$ (2) ). This can be accomplished by using an offset in Poisson regression. An offset is a column in the design matrix for which no parameter is estimated, i.e. the parameter is fixed to 1. The odds ratio is specified to give the desired level of dependence between lists. If we denote the prespecified odds ratio by ${\tilde{θ}}^{AB}$ , then the desired loglinear interaction parameter is ${\tilde{λ}}_{11}^{AB} = \log {\tilde{θ}}^{AB}$ . Thus the population that we want to generate has to follow the following model: (11) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + {\tilde{λ}}_{ij}^{AB} .$ (11) The following equation, an adjusted version of Equation (Equation10(10) $[\begin{matrix} \log (μ_{11}) \\ \log (μ_{10}) \\ \log (μ_{01}) \\ \log (μ_{00}) \end{matrix}] = [\begin{array}{cccc} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{11}^{AB} \end{matrix}],$ (10) ), illustrates how this is implemented in Poisson regression: (12) $[\begin{matrix} \log (μ_{11}) \\ \log (μ_{10}) \\ \log (μ_{01}) \\ \log (μ_{00}) \end{matrix}] = [\begin{array}{cccc} 1 & 1 & 1 & \log {\tilde{θ}}^{AB} \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ 1 \end{matrix}] .$ (12) Notice that the parameter $λ_{11}^{AB}$ from (Equation10(10) $[\begin{matrix} \log (μ_{11}) \\ \log (μ_{10}) \\ \log (μ_{01}) \\ \log (μ_{00}) \end{matrix}] = [\begin{array}{cccc} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{11}^{AB} \end{matrix}],$ (10) ) is set to 1, i.e. it is not estimated, and this value 1 is multiplied with the prespecified $\log {\tilde{θ}}^{AB}$ in the design matrix. In software packages for Poisson regression this can be accomplished as follows:

specify an offset vector which is $\log {\tilde{θ}}^{AB}$ when $(i, j) = (1, 1)$ and 0 otherwise.
use an independence model as the design matrix, as we only need estimates for $λ, λ_{i}^{A}$ and $λ_{j}^{B}$ .
in order to fit a model that leads to estimated counts that follow the prespecified marginal probabilities, calculate joint probabilities for which these marginal probabilities hold, and use these joint probabilities, multiplied with some constant, as the counts $n_{ij}$ to be analyzed with model (Equation11(11) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + {\tilde{λ}}_{ij}^{AB} .$ (11) ).
then the estimates of expected frequencies can be used to derive the joint probabilities with the prespecified marginal probabilities and the prespecified odds ratio.

As an example, assume we want to generate population probabilities that have the prespecified marginal probabilities 0.7 and 0.8 and the prespecified odds ratio 2. Then the probabilities to be analyzed are 0.7 * 0.8 = 0.56, 0.7 * 0.2 = 0.14, 0.3 * 0.8 = 0.24 and 0.3 * 0.2 = 0.06, and multiply these probabilities with some number, such as 1000, to get counts. Thus 560, 140, 240 and 60 are the values of the dependent variable in Poisson regression. Specify $\log (2)$ as the offset and analyze the counts with the loglinear independence model and the offset. This yields the fitted values 584.8, 115.2, 215.2, 84.8 (rounded to 1 decimal place), and by dividing them by 1000 we have the four probabilities, with the prespecified marginal probabilities 0.7 and 0.8 and the prespecified odds ratio 2. The R-code we used can be found in the online Supplementary Material.

We now prove that for model (Equation11(11) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + {\tilde{λ}}_{ij}^{AB} .$ (11) ) the prespecified margins of the observed table are equal to those in the fitted table, irrespective of the offset. We follow Fienberg [Citation27, Appendix II]. If we assume that the counts are observations from independent Poisson distributions, then the likelihood function is proportional to (13) $\prod_{ij} μ_{ij}^{n_{ij}} \exp (- μ_{ij}),$ (13) and the kernel of the loglikelihood is, (14) $\sum_{ij} n_{ij} \log μ_{ij} - \sum_{ij} μ_{ij} = \sum_{ij} n_{ij} (λ + λ_{i}^{A} + λ_{j}^{B} + {\tilde{λ}}_{ij}^{AB}) - \sum_{ij} μ_{ij} .$ (14) At the maximum of the likelihood the derivative of the loglikelihood over the parameters is 0. So, for example, taking the derivative over $λ_{i}^{A}$ gives $n_{i +} - {\hat{μ}}_{i +} = 0$ and this gives $n_{i +} = {\hat{μ}}_{i +}$ . Similarly for $n_{+ j} = {\hat{μ}}_{+ j}$ . This completes the proof. It follows that including an offset for the interaction does not affect the property of ordinary loglinear models that for the fitted parameters the corresponding margins of the fitted values are equal to those of the observed values.

The model for the $3 \times 3$ table in Section 2.1.1 has the following implementation: (15) $[\begin{matrix} \log (μ_{11}) \\ \log (μ_{12}) \\ \log (μ_{13}) \\ \log (μ_{21}) \\ \log (μ_{22}) \\ \log (μ_{23}) \\ \log (μ_{31}) \\ \log (μ_{32}) \\ \log (μ_{33}) \end{matrix}] = [\begin{array}{cccccc} 1 & 1 & 0 & 1 & 0 & \log {\tilde{θ}}_{11}^{AB} \\ 1 & 1 & 0 & 0 & 1 & \log {\tilde{θ}}_{12}^{AB} \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & \log {\tilde{θ}}_{21}^{AB} \\ 1 & 0 & 1 & 0 & 1 & \log {\tilde{θ}}_{22}^{AB} \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{2}^{A} \\ λ_{1}^{B} \\ λ_{2}^{B} \\ 1 \end{matrix}] .$ (15)

2.4.3. Model fitting for three-way tables

For the saturated model with three variables, the fitted values are equal to the observed counts so ${\hat{μ}}_{ijk} = n_{ijk}$ . As an example of a restricted model, the estimates for the [AB][AC] model can be calculated directly as (16) ${\hat{μ}}_{ijk} = \frac{{\hat{μ}}_{ij +} {\hat{μ}}_{i + k}}{{\hat{μ}}_{i + +}} = \frac{n_{ij +} n_{i + k}}{n_{i + +}} .$ (16) However, for the loglinear model [AB][AC][BC] estimates must be calculated indirectly, iteratively (for example, [Citation6], p. 84).

We now develop (Equation4(4) $\log μ_{ijk} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{ij}^{AB} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (4) ) in the form of Poisson regression. There are eight elements $n_{ijk}$ , being counts from eight Poisson distributions. The likelihood can be set up for $μ_{ijk}$ and maximised over the parameters. Model (Equation4(4) $\log μ_{ijk} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{ij}^{AB} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (4) ) can be denoted as, (17) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{cccccccc} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ λ_{11}^{AB} \\ λ_{11}^{AC} \\ λ_{11}^{BC} \\ λ_{111}^{ABC} \end{matrix}] .$ (17) As before, for (Equation17(17) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{cccccccc} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ λ_{11}^{AB} \\ λ_{11}^{AC} \\ λ_{11}^{BC} \\ λ_{111}^{ABC} \end{matrix}] .$ (17) ) the matrix with 0's and 1's is the design matrix formulating the loglinear model. A restricted model can be fitted by dropping columns in Equation (Equation17(17) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{cccccccc} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ λ_{11}^{AB} \\ λ_{11}^{AC} \\ λ_{11}^{BC} \\ λ_{111}^{ABC} \end{matrix}] .$ (17) ). For example, in fitting the model where B and C are independent given A, the last two columns of the design matrix and the last two parameters are left out of the equation.

2.4.4. Generating three-way tables with given properties

We now focus on loglinear models with prescribed odds ratios. We start with the model with no three-factor interaction. The model for a three-way contingency table with three prespecified odds ratios ${\tilde{θ}}^{AB}, {\tilde{θ}}^{AC}$ and ${\tilde{θ}}^{BC}$ is denoted as, (18) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + \log {\tilde{θ}}^{AB} + \log {\tilde{θ}}^{AC} + \log {\tilde{θ}}^{BC} .$ (18) In the adjusted version of (Equation17(17) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{cccccccc} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ λ_{11}^{AB} \\ λ_{11}^{AC} \\ λ_{11}^{BC} \\ λ_{111}^{ABC} \end{matrix}] .$ (17) ) this becomes (using that $\log {\tilde{θ}}^{AB} + \log {\tilde{θ}}^{AC} + \log {\tilde{θ}}^{BC} = \log {\tilde{θ}}^{AB} {\tilde{θ}}^{AC} {\tilde{θ}}^{BC}$ ): (19) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB} {\tilde{θ}}^{AC} {\tilde{θ}}^{BC} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}],$ (19) where the parameter for the offsets has been set to 1. In software packages for Poisson regression this can be accomplished as follows:

specify $\log {\tilde{θ}}^{AB}$ , $\log {\tilde{θ}}^{AC}$ and $\log {\tilde{θ}}^{BC}$ , and use them to generate the offset column of the design matrix as in (Equation19(19) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB} {\tilde{θ}}^{AC} {\tilde{θ}}^{BC} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}],$ (19) ).
use an independence model in the design matrix, as we only need estimates for $λ, λ_{i}^{A}$ , $λ_{j}^{B}$ and $λ_{k}^{C}$ .
in order to fit a model that leads to estimated counts that follow prespecified marginal probabilities, calculate joint probabilities for which these marginal probabilities hold by multiplying the relevant probabilities. Use these joint probabilities, multiplied by some constant representing a notional population size, as the observed counts corresponding with $n_{ijk}$ to be analyzed with model (Equation19(19) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB} {\tilde{θ}}^{AC} {\tilde{θ}}^{BC} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}],$ (19) ).
then, as before in Section 2.4.1, the estimates of the expected frequencies can be used to derive the joint probabilities with the prespecified marginal probabilities and the prespecified odds ratios.

As an example, assume we want to generate population probabilities that have the prespecified marginal probabilities 0.8, 0.7 and 0.9 for variables A, B and C respectively, and with all the prespecified odds ratios $= 2$ . The probabilities following the independence model are 0.7 * 0.8 * 0.9 = 0.504, 0.7 * 0.8 * 0.1 = 0.056, 0.7 * 0.2 * 0.9 = 0.216, 0.7 * 0.2 * 0.1 = 0.024, 0.3 * 0.8 * 0.9 = 0.126, 0.3 * 0.8 * 0.1 = 0.014, 0.3 * 0.2 * 0.9 = 0.054 and 0.3 * 0.2 * 0.1 = 0.006 and we multiply these probabilities by 1,000 to get counts 504, 56, 216, 24, 126, 14, 54 and 6 which form the values of the dependent variable in Poisson regression. Specify $\log (2)$ as the three offsets and analyze the counts with the loglinear independence model and the offset as in Equation (Equation19(19) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB} {\tilde{θ}}^{AC} {\tilde{θ}}^{BC} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}],$ (19) ). This yields the fitted values (rounded to 1 decimal place) 547.3, 39.5, 186.3, 26.9, 99.0, 14.3, 67.4 and 19.4. By dividing these by 1000 we have the eight probabilities with the prespecified marginal probabilities 0.7, 0.8 and 0.9 and the three prespecified odds ratios 2. For example, the marginal probability of being in variable A equals the sum of the fitted values for A = 1 divided by 1000, which is 0.8 and the conditional odds $θ^{BC | A = 1} = (547.3 * 26.9) / (39.5 * 186.3) = 2$ . The R-code we used can be found in the online Supplementary Material.

We now discuss the model with a prespecified three-factor interaction. Above we discussed model fitting for three-way tables, with no three-factor interaction. Here we develop the model to include a prespecified three-factor interaction parameter in a loglinear model. The three-factor interaction is defined as the state where two conditional odds ratios are not equal, for example, ${\tilde{θ}}^{AB | C = 0} \neq {\tilde{θ}}^{AB | C = 1}$ . These conditional odds ratios are related to loglinear parameters by ${\tilde{θ}}^{AB | C = 0} = \exp {\tilde{λ}}_{11}^{AB}$ and ${\tilde{θ}}^{AB | C = 1} = \exp [{\tilde{λ}}_{11}^{AB} + {\tilde{λ}}_{111}^{ABC}]$ . As similar results hold for ${\tilde{θ}}^{AC | B = 0}$ , ${\tilde{θ}}^{AC | B = 1}$ , ${\tilde{θ}}^{BC | A = 0}$ and ${\tilde{θ}}^{BC | A = 1}$ this shows that the three-way interaction is represented by $\exp {\tilde{λ}}_{111}^{ABC}$ . A three-way interaction will be handled as follows. In Equation (Equation19(19) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB} {\tilde{θ}}^{AC} {\tilde{θ}}^{BC} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}],$ (19) ) we fill in $\log {\tilde{θ}}^{AB | C = 0}$ in cell (2,5) of the design matrix and $\log {\tilde{θ}}^{AB | C = 1}$ in cell (1,5) of the design matrix, and similarly for the other conditional odds ratios. Thus, the model with prespecified conditional odds ratios can be presented as, (20) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB | C = 1} {\tilde{θ}}^{AC | B = 1} {\tilde{θ}}^{BC | A = 1} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB | C = 0} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC | B = 0} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC | A = 0} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}] .$ (20) We can trivially write (21) ${\tilde{θ}}^{AB | C = 1} = {\tilde{θ}}^{AB | C = 0} \frac{{\tilde{θ}}^{AB | C = 1}}{{\tilde{θ}}^{AB | C = 0}},$ (21) the term $[{\tilde{θ}}^{AB | C = 1} / {\tilde{θ}}^{AB | C = 0}]$ , which is the difference in odds for C = 1 and C = 0, is equal to $\exp {\tilde{λ}}_{111}^{ABC}$ . The terms ${\tilde{θ}}^{AC | B = 1}$ and ${\tilde{θ}}^{BC | A = 1}$ include this same three-way interaction term, i.e. (22) $\begin{aligned} {\tilde{θ}}^{AC | B = 1} & = {\tilde{θ}}^{AC | B = 0} \frac{{\tilde{θ}}^{AB | C = 1}}{{\tilde{θ}}^{AB | C = 0}}, \end{aligned}$ (22) (23) $\begin{aligned} {\tilde{θ}}^{BC | A = 1} & = {\tilde{θ}}^{BC | A = 0} \frac{{\tilde{θ}}^{AB | C = 1}}{{\tilde{θ}}^{AB | C = 0}} . \end{aligned}$ (23) In fact since the three-factor interaction parameter ${\tilde{λ}}_{111}^{ABC}$ is the difference between each pair of conditional odds ratios, it is clear that (24) $\frac{θ^{AB | C = 1}}{θ^{AB | C = 0}} = \frac{θ^{AC | B = 1}}{θ^{AC | B = 0}} = \frac{θ^{BC | A = 1}}{θ^{BC | A = 0}} .$ (24) If we specify the denominators of each fraction, then as soon as one numerator is specified then all the fractions are determined. There are therefore only four free odds ratios in Equation (Equation24(24) $\frac{θ^{AB | C = 1}}{θ^{AB | C = 0}} = \frac{θ^{AC | B = 1}}{θ^{AC | B = 0}} = \frac{θ^{BC | A = 1}}{θ^{BC | A = 0}} .$ (24) ), and the three-factor interaction can be specified by providing all the denominators and any one of the numerators. Therefore we can rewrite element (1,5) of the design matrix in (Equation20(20) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log {\tilde{θ}}^{AB | C = 1} {\tilde{θ}}^{AC | B = 1} {\tilde{θ}}^{BC | A = 1} \\ 1 & 1 & 1 & 0 & \log {\tilde{θ}}^{AB | C = 0} \\ 1 & 1 & 0 & 1 & \log {\tilde{θ}}^{AC | B = 0} \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log {\tilde{θ}}^{BC | A = 0} \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}] .$ (20) ) as (25) $\log {\tilde{θ}}^{AB | C = 1} {\tilde{θ}}^{AC | B = 1} {\tilde{θ}}^{BC | A = 1} = \log {\tilde{θ}}^{AB | C = 0} {\tilde{θ}}^{AC | B = 0} {\tilde{θ}}^{BC | A = 0} {[\frac{{\tilde{θ}}^{AB | C = 1}}{{\tilde{θ}}^{AB | C = 0}}]}^{3},$ (25) using only four odds ratios.

As an example, assume that we want to create population probabilities with conditional odds ratios ${\tilde{θ}}^{AB | C = 0} = 2$ and ${\tilde{θ}}^{AB | C = 1} = 3$ . It follows that the three factor term is $[{\tilde{θ}}^{AB | C = 1} / {\tilde{θ}}^{AB | C = 0}] = 3 / 2$ . Then, let ${\tilde{θ}}^{AC | B = 1} = 1$ , and it follows that ${\tilde{θ}}^{AC | B = 0} [{\tilde{θ}}^{AB | C = 1} / {\tilde{θ}}^{AB | C = 0}] = 1$ , so that ${\tilde{θ}}^{AC | B = 0} = 2 / 3$ . Similarly, if ${\tilde{θ}}^{BC | A = 1} = 1$ , then ${\tilde{θ}}^{BC | A = 0} = 2 / 3$ . Thus the model equation becomes (26) $[\begin{matrix} \log (μ_{111}) \\ \log (μ_{110}) \\ \log (μ_{101}) \\ \log (μ_{100}) \\ \log (μ_{011}) \\ \log (μ_{010}) \\ \log (μ_{001}) \\ \log (μ_{000}) \end{matrix}] = [\begin{array}{ccccc} 1 & 1 & 1 & 1 & \log 3 \\ 1 & 1 & 1 & 0 & \log 2 \\ 1 & 1 & 0 & 1 & \log (2 / 3) \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & \log (2 / 3) \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{array}] [\begin{matrix} λ \\ λ_{1}^{A} \\ λ_{1}^{B} \\ λ_{1}^{C} \\ 1 \end{matrix}] .$ (26)

3. Application to population size estimation

3.1. Dual system estimation

Nowadays dual system estimation methods are commonly used to estimate population totals for domains of interest. The dual system estimator is the simplest capture-recapture model and is used to estimate the part of the population that is missing after having linked individuals in two overlapping lists. Originally this method was implemented to estimate the size and density of wildlife populations, but it has also been developed and used in official statistics, where the dual-system estimator was originally used for the 1950 decennial United States Census to estimate coverage error [Citation28].

The linked data can be presented in a two-way contingency table: we observe the counts of those in both lists, those in list A only, and those in list B only, but not those missing from both lists. Thus there are three observed cell counts, and the missing count for those in neither of the lists is to be estimated, which can conveniently be done using the loglinear model framework [Citation29].

Dual system estimation relies on five key assumptions, outlined by the International Working Group for Disease Monitoring and Forecasting [Citation30]:

The population is closed, i.e. there is no change such as births or deaths.
There is perfect matching between the two lists.
The capture/inclusion probability of elements in at least one of the lists is homogeneous [Citation31,Citation32].
The probabilities of inclusion in the lists are independent.
There are no erroneous captures in any list (no overcoverage).

One of the key assumptions is that the probabilities of inclusion in the lists are independent, which is easily violated [Citation24]. However, with the data used for population size estimation, this assumption is impossible to verify and so the design of both lists is very important, to create the conditions where this assumption is likely to hold. A way to reduce the impact of violating this strict assumption is to divide the population by certain characteristics, such as age, sex and some geographical breakdown, so that we need only assume that the lists are independent conditional on these classifiers.

Assumption 4 follows on from assumption 3, which implies that individuals who where included in the first list and those who were not have the same probability of being included in the second list, so that presence in the first list does not affect response in the second list, and the lists are therefore statistically independent [Citation30]. Therefore, as there are three observed counts, to estimate cell counts ${\hat{μ}}_{ij}$ for a two-way contingency table only three parameters can be used and no interaction term can be modelled, and (Equation1(1) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{ij}^{AB},$ (1) ) is adjusted to include no interaction term, i.e. $λ_{ij}^{AB} = 0$ . Independence implies that the odds ratio (Equation2(2) $θ^{AB} = exp (λ_{11}^{AB}) = \frac{(μ_{11} / μ_{01})}{(μ_{10} / μ_{00})} = \frac{μ_{11} μ_{00}}{μ_{01} μ_{10}} .$ (2) ) is 1, so that the missing count is estimated by ${\hat{μ}}_{00} = {\hat{μ}}_{10} {\hat{μ}}_{01} / {\hat{μ}}_{11} = n_{10} n_{01} / n_{11}$ .

3.2. Triple system estimation

The triple system estimator uses three linked lists. Counts can be summarized in a three-way contingency table where one count is missing, namely the count for the individuals that are missed by all three lists. By estimating this count and adding it to the seven observed counts, the population size is estimated. Unlike a two-way contingency table, in a three-way contingency table the pairwise interactions between lists can be modelled. This means that the homogeneous capture/inclusion probabilities and the independence assumption can be relaxed, while the other assumptions outlined for dual system estimation still hold.

As there are seven cells with observed counts, only loglinear models with up to seven independent parameters can be fitted. This is usually solved by assuming the three-factor interaction term ${\hat{λ}}_{ijk}^{ABC} = 0$ . Including two-factor interaction terms in the model enables us to estimate population totals where pairwise relationships are present [Citation6]. For this paper, we are interested in fitting the maximal model for a three-way contingency table with one unobserved cell, which is an adjusted version of (Equation4(4) $\log μ_{ijk} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + λ_{ij}^{AB} + λ_{ik}^{AC} + λ_{jk}^{BC} + λ_{ijk}^{ABC},$ (4) ) with no three factor interaction term.

3.3. How to apply the proposed methodology to population size estimation

The following outlines how to generate contingency tables with prespecified margins and interactions between lists. Such generated contingency tables can be used in simulations to study the behaviour of estimators for the population size, as the true population size is known.

To generate a contingency table where lists are independent, the total population size and inclusion probabilities for each list are prespecified. The cell counts are then computed by multiplying the marginal inclusion probabilities and the population size. This results in cell counts that follow statistical independence between lists.

To generate a table with a given dependence structure, an independent loglinear model is then fitted to these cell counts with the inclusion of an offset term, as presented in (Equation11(11) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + {\tilde{λ}}_{ij}^{AB} .$ (11) ) and (Equation18(18) $\log μ_{ij} = λ + λ_{i}^{A} + λ_{j}^{B} + λ_{k}^{C} + \log {\tilde{θ}}^{AB} + \log {\tilde{θ}}^{AC} + \log {\tilde{θ}}^{BC} .$ (18) ). Then the fitted counts have the following properties:

Their sum is equal to the prespecified population size,
The prespecified odds ratios hold, and
The prespecified marginal response probabilities hold.

In order to investigate the behaviour of population size estimators, multinomial samples are drawn from this population. For the study of the dual system estimator, in each sample the cell (0,0) is set to missing and the estimator is applied. For the study of the triple systems estimator, in each sample the cell (0,0,0) is set to missing and the estimator is applied. When t samples are drawn, t population size estimates are obtained that can be compared with the prespecified population size.

3.4. Simulation study

The aim of the simulation study is to illustrate the methodology. Probabilities are generated using prespecified marginal probabilities and odds ratios. 2000 multinomial samples are drawn with these probabilities and with the total population size set equal to 1000 or 2000. The prespecified probabilities and odds ratios mirror those used by Brown et al. [Citation22] in their investigation of methods to deal with dependence in the coverage estimation for the 2001 population census of England and Wales. The outcomes of the simulations are presented in Tables and . The results include the specified values for the total population, the marginal response probabilities, the odds ratio (OR) between the lists, the mean and median of the population size estimates, empirical 95% confidence intervals (which is the range between the 2.5% and 97.5% quantiles of the estimates), the relative bias %, $(\hat{μ} - N) / N$ and the Coefficient of Variation $\hat{σ}$ /N, where N is the prespecified population size, $\hat{μ}$ is the mean and $\hat{σ}$ is the standard deviation of the population size estimates over the simulations.

Table 1. Simulation results for dual systems estimation with inclusion probabilities $π_{A}$ and $π_{B}$ , sizes of odds ratios (OR), the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias and coefficient of variation (CV), where the fitted model is [A][B].

Display Table

Table 2. Simulation results for triple systems estimation with inclusion probabilities $π_{A}$ = 0.8, $π_{B}$ = 0.7 and $π_{C}$ = 0.9, with different sizes of odds ratios (OR), specified models, the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias and coefficient of variation (CV).

Display Table

3.4.1. Two-way table

We set the response probabilities to either, for list A ( $π_{A}$ ) 0.8 and for list B ( $π_{B}$ ) 0.7; or for list A ( $π_{A}$ ) 0.9 and for list B ( $π_{B}$ ) 0.8. The joint probabilities ( $π_{11}, π_{10}, π_{01}, π_{00}$ ) are, respectively, (0.56, 0.24, 0.14, 0.06) and (0.72, 0.18, 0.08 and 0.02). Cell counts are obtained by $μ_{ij} = (N π_{ij})$ .

The results presented in Table from these simulations show, that when the lists are independent (OR = 1), and the independence model is fitted, the mean of the estimated population size across 2000 multinomial samples is very close to the true population size. However, the bias appears to depend on the size of the population and the response probabilities for each list. When the total population is smaller, N = 1000, and the response probabilities are lower, $π_{A}$ = 0.7 and $π_{B}$ = 0.8, for the lists, the mean of the estimated population sizes is slightly larger than the true population size. However, the bias is negligible when N = 1000 and $π_{A}$ = 0.8 and $π_{B}$ = 0.9, as the mean estimated population total across all multinomial samples is approximately equal to the true population total. Therefore, the higher the response probabilities of the lists, the less biased the estimates of the population size will be.

When the lists are dependent, and the independence model is fitted, the mean estimated population size is smaller than the true population size. This shows that positive odds ratios result in population size estimates being negatively biased (compare with [Citation24]). In some cases, when the odds ratio = 1.1, the true population size falls within the confidence interval. This is an issue of power, as the misspecification of the model is small, and the number of simulations is apparently not large enough to detect it.

3.4.2. Three-way table

In our simulation, the three lists have response probabilities 0.8 for list A ( $π_{A}$ ), 0.7 for list B ( $π_{B}$ ) and 0.9 for list C ( $π_{C}$ ). The joint probabilities ( $π_{111}, π_{110}, π_{101}, π_{100}$ , $π_{011}, π_{010}, π_{001}, π_{000}$ ) are, respectively, (0.504, 0.056, 0.216, 0.024, 0.126, 0.054, 0.014, 0.006). Cell counts are obtained by $μ_{ijk} = (N π_{ijk})$ .

The results of the simulation are shown in Table . The model used to generate the population data is specified by the odds ratios in the column OR, and the model used to analyze the data in the column Model. Thus there are correctly specified models, such as OR = (1,1,1) and Model is [A][B][C], overparameterised models, such as OR = (1,1,1) and Model is [AB][AC][BC], and misspecified models, such as OR = (1.5, 2, 1) and Model is [A][BC]. When the model is specified correctly the relative bias is close to zero. The variance of the estimates, decreases as the population size increases, compare, for example, OR = (1,1,1) and Model [A][B][C] with population sizes 1000 and 2000. When the model is overparameterised, the variance is larger than when the model is correctly specified.

4. Discussion

We proposed a simple method to generate contingency tables with fixed marginal probabilities and dependence structures described by loglinear models. We discuss the generation of two-way and three-way contingency tables and this can be extended to in a straightforward way.

We specify the loglinear model in terms of Poisson regression with a log link. Using an offset, prespecified interaction parameters can be fitted that can be understood in terms of odds ratios. For two-way tables, odds ratios describe dependence between two variables, and conditional odds ratios describe partial dependence between three or more variables in higher-way tables.

We think that this method is simpler than previously proposed methods which are discussed in the introduction, and it can be extended to other models that can be presented as loglinear models, such as logistic regression models with categorical predictors and some latent variable models. The structure of our approach to generating data is the same as the standard approach for undertaking multiple system estimation, so it is straightforward to understand how the dependence is introduced. Our approach is less tailored to the generalized estimating equations which have been the main application of other approaches.

In this paper we apply this method to population size estimation for two-way and three-way contingency tables. The results show the impacts of dependence between lists and model (mis)specification.

Supplemental material

Supplemental Material

Download PDF (43.5 KB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Agresti A. Categorical data analysis. New Jersey: Wiley; 2013.
Google Scholar
Haberman SJ. A stabilized Newton–Raphson algorithm for log-linear models for frequency tables derived by indirect observation. Sociol Methodol. 1988;18:193–211. doi: 10.2307/271049
Google Scholar
Hessen DJ. Loglinear representations of multivariate Bernoulli Rasch models. Br J Math Stat Psychol. 2011;64:337–354. doi: 10.1348/2044-8317.002000
PubMed Web of Science ®Google Scholar
Nelder JA, Wedderburn RWM. Generalized linear models. J R Stat Soc Ser A. 1972;135(3):370–384. doi: 10.2307/2344614
Web of Science ®Google Scholar
Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J R Stat Soc Ser B. 1984;46:149–170. doi: 10.1111/j.2517-6161.1984.tb01288.x
Google Scholar
Bishop YM, Fienberg SE, Holland PW. Discrete multivariate analysis, theory and practice. New York: McGraw-Hill; 1975.
Google Scholar
Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Ann Math Stat. 1963;34(3):911–934. doi: 10.1214/aoms/1177704014
Google Scholar
Lee AJ. Generating random binary deviates having fixed marginal distributions and specified degrees of association. Am Stat. 1993;47:209–215. doi: 10.1080/00031305.1993.10475980
Web of Science ®Google Scholar
Gange SJ. Generating multivariate categorical variates using the iterative proportional fitting algorithm. Am Stat. 1995;49(2):134–138. doi: 10.1080/00031305.1995.10476130
Web of Science ®Google Scholar
Choi HJ. A simple method for constructing multidimensional distributions of correlated categorical data. Commun Stat Simul Comput. 2008;37(7):1377–1384. doi: 10.1080/03610910802001863
Web of Science ®Google Scholar
Kateri M. Contingency table analysis. New York: Springer; 2014.
Google Scholar
Geenens G. Copula modeling for discrete random vectors. Depend Model. 2020;8(1):417–440. doi: 10.1515/demo-2020-0022
Web of Science ®Google Scholar
Emrich LJ, Piedmonte MR. A method for generating high-dimensional multivariate binary variates. Am Stat. 1991;45(4):302–304. doi: 10.1080/00031305.1991.10475828
Web of Science ®Google Scholar
Park CG, Park T, Shin DW. A simple method for generating correlated binary variates. Am Stat. 1996;50(4):306–310. doi: 10.1080/00031305.1996.10473557
Web of Science ®Google Scholar
Lee AJ. Some simple methods for generating correlated categorical variates. Comput Stat Data Anal. 1997;26(2):133–148. doi: 10.1016/S0167-9473(97)00030-3
Web of Science ®Google Scholar
Lunn AD, Davies SJ. A note on generating correlated binary variables. Biometrika. 1998;85(2):487–490. doi: 10.1093/biomet/85.2.487
Web of Science ®Google Scholar
Al Osh MA, Lee SJ. A simple approach for generating correlated binary variates. J Stat Comput Simul. 2001;70(3):231–255. doi: 10.1080/00949650108812119
Web of Science ®Google Scholar
Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. J Stat Comput Simul. 2006;76(11):1017–1025. doi: 10.1080/10629360600569246
Web of Science ®Google Scholar
Jiang W, Song S, Hou L, et al. A set of efficient methods to generate high-dimensional binary data with specified correlation structures. Am Stat. 2021;75(3):310–322. doi: 10.1080/00031305.2020.1816213
Web of Science ®Google Scholar
Amatya A, Demirtas H. Multiord: an R package for generating correlated ordinal data. Commun Stat Simul Comput. 2015;44(7):1683–1691. doi: 10.1080/03610918.2013.824097
Web of Science ®Google Scholar
Touloumis A. Simulating correlated binary and multinomial responses under marginal model specification: the SimCorMultRes package. R J. 2016;8(2):79–91. doi: 10.32614/RJ-2016-034
Google Scholar
Brown J, Abbott O, Diamond I. Dependence in the 2001 one-number census project. J R Stat Soc Ser A Stat Soc. 2006;169:883–902. doi: 10.1111/j.1467-985X.2006.00431.x
Web of Science ®Google Scholar
Baffour B, Brown JJ, Smith PWF. An investigation of triple system estimators in censuses. Stat J IAOS. 2013;29:53–68. doi: 10.3233/SJI-130760.
Google Scholar
Gerritse SC, Bakker BFM, van der Heijden PGM. Different methods to complete datasets used for capture–recapture estimation: estimating the number of usual residents in the Netherlands. Stat J IAOS. 2015;31:613–627. doi: 10.3233/SJI-150938
Google Scholar
Rasch G. Probabilistic models for some intelligence and attainment tests. The University of Chicago Press; Originally published 1960, Copenhagen: The Danish Institute of Educational Research, Chicago; 1980.
Google Scholar
Lazarsfeld PF, Henry NW. Latent structure analysis. Boston: Houghton Mifflin; 1968.
Google Scholar
Fienberg SE. The analysis of cross-classified categorical data. 2nd ed. Cambridge: Massachusetts Institute of Technology Press; 1980.
Google Scholar
Wolter KM. Some coverage error models for census data. J Am Stat Assoc. 1986;81(394):337–346. doi: 10.1080/01621459.1986.10478277
Web of Science ®Google Scholar
Fienberg SE. The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika. 1972;59:409–439. doi: 10.1093/biomet/59.3.591.
Web of Science ®Google Scholar
International Working Group for Disease Monitoring and Forecasting. Capture–recapture and multiple record systems estimation. Part I. History and theoretical development. Am J Epidemiol. 1995;142:1059–1068. doi: 10.1093/oxfordjournals.aje.a117559
PubMed Web of Science ®Google Scholar
Seber GAF. The estimation of animal abundance and related parameters. Caldwell (NJ): Blackburn Press; 1982.
Google Scholar
Van der Heijden PGM, Whittaker J, Cruyff M, et al. People born in the Middle East but residing in the Netherlands: invariant population size estimates and the role of active and passive covariates. Ann Appl Stat. 2012;6:831–852. doi: 10.1214/12-AOAS536
Web of Science ®Google Scholar

Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models

Abstract

1. Introduction

2. Methodology

2.1. Loglinear models

2.1.1. Models for two-way tables

2.1.2. Models for three-way tables

2.2. Latent variable models

2.3. Logistic regression

2.4. The proposed method

2.4.1. Model fitting for two-way tables

2.4.2. Generating two-way tables with given properties

2.4.3. Model fitting for three-way tables

2.4.4. Generating three-way tables with given properties

3. Application to population size estimation

3.1. Dual system estimation

3.2. Triple system estimation

3.3. How to apply the proposed methodology to population size estimation

3.4. Simulation study

3.4.1. Two-way table

3.4.2. Three-way table

4. Discussion

Supplemental Material

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models

Abstract

1. Introduction

2. Methodology

2.1. Loglinear models

2.1.1. Models for two-way tables

2.1.2. Models for three-way tables

2.2. Latent variable models

2.3. Logistic regression

2.4. The proposed method

2.4.1. Model fitting for two-way tables

2.4.2. Generating two-way tables with given properties

2.4.3. Model fitting for three-way tables

2.4.4. Generating three-way tables with given properties

3. Application to population size estimation

3.1. Dual system estimation

3.2. Triple system estimation

3.3. How to apply the proposed methodology to population size estimation

3.4. Simulation study

Table 1. Simulation results for dual systems estimation with inclusion probabilities πA and πB, sizes of odds ratios (OR), the mean and median of the estimates, the 95% confidence interval of the mean, the relative bias and coefficient of variation (CV), where the fitted model is [A][B].

3.4.1. Two-way table

3.4.2. Three-way table

4. Discussion

Supplemental Material

Disclosure statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date