Full article: Classical Models for Twin Data: The Case of Categorical Data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The classical models ACE and ADE were used in the 1990’s to estimate heredity of a phenotype from data on monozygotic and dizygotic twins. The author extended these models to a model called ACDE with four parameters instead of only three. In that paper, the data were assumed to be continuous. This paper considers the same models in the case where the data is categorical. It is showed how these models can be estimated by maximum likelihood. An example is given based on twin data on BMI from the UK. This is the same data as in the previous paper but in categorized form.

KEYWORDS:

Twin data models

Models for twin data or other types of family relationships were common in the 1980’s and 1990’s see e.g., Neale and Cardon (Citation1992). One has data on monozygotic (MZ) twins and dizygotic (DZ) twins. It is assumed that MZ twins have all their genes in common and dz twins have half of their genes in common. We observe a phenotype on each twin. A phenotype may be anything that one can measure or observe on each twin, such as symptoms (e.g., allergy), illnesses (e.g.,cancer, diabetes, asthma), physical measures (e.g., weight, height), personality traits (e.g.,nervousness), longevity (how long you live) etc.

Jöreskog (Citation2020) covered the case when the phenotypes were continuous and showed how the classical models ACE, ADE, and ACDE could be estimated. This paper covers the case when the phenotypes are categorical. This requires a different methodology.

Data

Thera are $T_{m z}$ monozygotic twins and $T_{d z}$ dizygotic twins. Each twin in a twin pair responds on a set of $c$ ordered categories. Typically, $c = 2, 3, 4, 5$ . So the data is represented by a two-way contingency table, one for the MZ group and one for the DZ group:

\begin{aligned} N_{m z} = [\begin{matrix} n_{11}^{(m z)} & n_{12}^{(m z)} & \dots & n_{1 c}^{(m z)} \\ n_{21}^{(m z)} & n_{22}^{(m z)} & \dots & n_{2 c}^{(m z)} \\ ⋮ & ⋮ & ⋮ ⋮ & ⋮ \\ n_{c 1}^{(m z)} & n_{c 2}^{(m z)} & \dots & n_{c c}^{(m z)} \end{matrix}], \\ N_{d z} = [\begin{matrix} n_{11}^{(d z)} & n_{12}^{(d z)} & \dots & n_{1 c}^{(d z)} \\ n_{21}^{(d z)} & n_{22}^{(d z)} & \dots & n_{2 c}^{(d z)} \\ ⋮ & ⋮ & ⋮ ⋮ & ⋮ \\ n_{c 1}^{(d z)} & n_{c 2}^{(d z)} & \dots & n_{c c}^{(d z)} \end{matrix}] \end{aligned}

where $n_{i j}$ is the number of twins in the sample where twin 1 responds on category $i$ and twin 2 responds on category j. We have

T_{m z} = \sum_{i = 1}^{c} \sum_{j = 1}^{c} n_{i j}^{(m z)}, T_{d z} = \sum_{i = 1}^{c} \sum_{j = 1}^{c} n_{i j}^{(d z)}

Model

Let $π_{i j}$ be the probability of a response in categories $i$ and $j$ as described in the previous section. The model is

(1)

\begin{aligned} π_{i j}^{(m z)} = \int_{τ_{i - 1}}^{τ_{i}} \int_{τ_{j - 1}}^{τ_{j}} ϕ_{2} (u, v, ρ_{m z}) d u d v \\ π_{i j}^{(d z)} = \int_{τ_{i - 1}}^{τ_{i}} \int_{τ_{j - 1}}^{τ_{j}} ϕ_{2} (u, v, ρ_{d z}) d u d v, \end{aligned}

(1)

where

ϕ_{2} (u, v, ρ) = \frac{1}{2 π \sqrt{1 - ρ^{2}}} e^{- \frac{1}{2 (1 - ρ^{2})} (u^{2} - 2 ρ u v + v^{2})},

is the standard bivariate normal density with correlation $ρ$ . The thresholds

τ_{0} = - \infty < τ_{1} < τ_{2} < \dots < τ_{c - 1} < τ_{c} = \infty

are assumed to be the same within and between groups. If there are $c$ categories, there are $c - 1$ strictly inceasing thresholds. This choice of thresholds ensures that the only difference between the MZ and DZ group is manifested through the difference between $ρ_{m z}$ and $ρ_{d z}$ .

Note that $π_{i j} = π_{j i},$ for both MZ and DZ. Since the marginal density functions of $ϕ_{2} (u, v; ρ)$ is the standard normal density function independent of $ρ$ , it follows from (1) that the univariate row and column marginals

\sum_{j = 1}^{c} π_{i j} = \sum_{j = 1}^{c} π_{j i}, i = 1, 2, \dots, c,

are equal and the same for both MZ and DZ. Denoting this common marginal as

α = (α_{1}, α_{2}, \dots, α_{c}),

we have $α_{i} > 0$ and $\sum_{i = 1}^{c} α_{i} = 1$ Then

τ_{i} = Φ^{- 1} (α_{1} + α_{2} + \dots + α_{i}), i = 1, 2, \dots, c - 1.

Reversely,

α_{i} = Φ (τ_{i}) - Φ (τ_{i - 1}), i = 1, 2, \dots, c .

$Φ$ is the standard normal distribution function and $Φ^{- 1}$ is its inverse function. Note that the $α_{i}$ , and hence, the $τ_{i}$ do not depend on $ρ_{m z}$ or $ρ_{d z}$ .

Here we use the same notation and definitions of $h^{2}$ , $c^{2}$ , $d^{2}$ , and $e^{2}$ as in the previous paper (Jöreskog, Citation2020). Since there is no scale of the categorical measurement, it is assumed that the total variance $h^{2} + c^{2} + d^{2} + e^{2} = 1$ . So the only equations we have are

(2)

ρ_{m z} = h^{2} + c^{2} + d^{2}

(2)

(3)

ρ_{d z} = \frac{1}{2} h^{2} + c^{2} + \frac{1}{4} d^{2}

(3)

This is two equations in the three unknowns $h^{2}, c^{2}, d^{2}$ . There is no unique solution for $h^{2}, c^{2}, d^{2}$ . However, if $d^{2} = 0$ , we can solve for $h^{2}$ and $c^{2}$ . This gives

h^{2} = 2 (ρ_{m z} - ρ_{d z}), c^{2} = 2 ρ_{d z} - ρ_{m z}

corresponding to the ACE model. Similarly, if $c^{2} = 0$ , we can solve for $h^{2}$ and $d^{2}$ . This gives

h^{2} = 4 ρ_{d z} - ρ_{m z}, d^{2} = 2 ρ_{m z} - 4 ρ_{d z}

corresponding to the ADE model.

As argued in the previous paper, neither $c^{2} = 0$ nor $d^{2} = 0$ are reasonable assumptions. Therefore, an alternative model is considered as follows

EquationEquations (2)(2) $ρ_{m z} = h^{2} + c^{2} + d^{2}$ (2) and (Equation3(3) $ρ_{d z} = \frac{1}{2} h^{2} + c^{2} + \frac{1}{4} d^{2}$ (3) ) can be written in matrix form as

(4)

(\begin{matrix} ρ_{m z} \\ ρ_{d z} \end{matrix}) = (\begin{matrix} 1 & 1 & 1 \\ \frac{1}{2} & 1 & \frac{1}{4} \end{matrix}) (\begin{matrix} h^{2} \\ c^{2} \\ d^{2} \end{matrix})

(4)

Let A be the matrix in (4) and let $B$ be

B = A^{'} (A A^{'})^{- 1} .

Using paper and pencil algebra, gives

B = (\begin{matrix} \frac{1}{2} & - \frac{2}{7} \\ - \frac{1}{2} & 1 \frac{3}{7} \\ 1 & - 1 \frac{1}{7} \end{matrix}) .

The ACDE model is defined by

(\begin{matrix} h^{2} \\ c^{2} \\ d^{2} \end{matrix}) = B (\begin{matrix} ρ_{m z} \\ ρ_{d z} \end{matrix})

In particalar,

h^{2} = ρ_{m z} - \frac{2}{7} ρ_{d z}

Estimation

Assuming multinomial distributions of the $n_{i j}$ and independent groups, the log likelihood function is

ln L = \sum_{i = 1}^{c} \sum_{j = 1}^{c} n_{i j}^{(m z)} ln π_{i j}^{(m z)} + \sum_{i = 1}^{c} \sum_{j = 1}^{c} n_{i j}^{(d z)} ln π_{i j}^{(d z)} .

In the following the shorter notation $\sum_{i j}$ is used instead of $\sum_{i = 1}^{c} \sum_{j = 1}^{c}$ . Let

τ = (τ_{1}, τ_{2}, \dots, τ_{c - 1})

Assuming that the samples are sufficiently large so that all $n_{i j} > 0$ , it is convenient to minimize the fit function

F (θ) = F_{m z} (τ, ρ_{m z}) + F_{d z} (τ, ρ_{d z})

where

F_{m z} (τ, ρ_{m z}) = \sum_{i j} n_{i j}^{(m z)} ln p_{i j}^{(m z)} - \sum_{i j} n_{i j}^{(m z)} ln π_{i j}^{(m z)}

and

F_{d z} (τ, ρ_{d z}) = \sum_{i j} n_{i j}^{(d z)} ln p_{i j}^{(d z)} - \sum_{i j} n_{i j}^{(d z)} ln π_{i j}^{(d z)}

and where

p_{i j}^{(m z)} = n_{i j}^{(m z)} / T_{m z}, p_{i j}^{(d z)} = n_{i j}^{(d z)} / T_{d z} .

The parameter vector $θ$ is

θ = (τ_{1}, τ_{2}, \dots, τ_{c - 1}, ρ_{m z}, ρ_{d z})

The function $F (θ)$ which is non-negative, is to be minimized with respect to $θ$ . $F (θ)$ can be minimized numerically using first and second derivatives. At the minimum, the inverse of the expected Hessian matrix can be used as an estimate of the asymptotic covariance matrix of $\hat{θ}$ , and

C = 2 F (\hat{θ})

can be used as a chi-square statistic with $2 (c^{2} - 1) - (c + 1)$ degrees of freedom to test the fit of the model. This is an LR (likelihood ratio) statistic. Alternatively, one can use the GF (goodness-of-fit) statistic

G = \sum_{i j} \frac{{(n_{i j}^{(m z)} - T_{m z} {\hat{π}}_{i j}^{(m z)})}^{2}}{T_{m z} {\hat{π}}_{i j}^{(m z)}} + \sum_{i j} \frac{{(n_{i j}^{(d z)} - T_{d z} {\hat{π}}_{i j}^{(d z)})}^{2}}{T_{d z} {\hat{π}}_{i j}^{(d z)}},

which is also approximately distributed as $χ^{2}$ with the same degrees of freedom.

Example

In this example we use the same BMI data from the UK as in the previous paper, but the measurement from each twin is classified into three categories: normal, overweight, and obese on the basis of their BMI. So $c = 3$ , $T_{m z} = 794$ and $T_{d z} = 758$ . The contingency tables are

N_{m z} = (\begin{matrix} 325 & 78 & 8 \\ 60 & 150 & 38 \\ 7 & 40 & 88 \end{matrix}) N_{d z} = (\begin{matrix} 202 & 96 & 46 \\ 97 & 112 & 56 \\ 19 & 64 & 66 \end{matrix}) .

In terms of proportions these are

\begin{aligned} P_{m z} = (\begin{matrix} 0.4093 & 0.0982 & 0.0101 \\ 0.0756 & 0.1889 & 0.0479 \\ 0.0088 & 0.0504 & 0.1108 \end{matrix}) \\ P_{d z} = (\begin{matrix} 0.2665 & 0.1266 & 0.0607 \\ 0.1280 & 0.1478 & 0.0739 \\ 0.0251 & 0.0844 & 0.0871 \end{matrix}) . \end{aligned}

Because of the inequality $τ_{1} < τ_{2}$ and the restricted range of $ρ_{m z}$ and $ρ_{d z}$ it is neccessary to have good starting values for the numerical minimization of $F (θ)$ . These are obtained as follows.

Take the average of the row and column margins of $P_{m z}$ and $P_{d z}$ . Then form a wheighted average of these averages with weights $w_{m z} = T_{m z} / (T_{m z} + T_{d z})$ and $w_{d z} = T_{d z} / (T_{m z} + T_{d z})$ . This gives $α$ as

α = (0.472, 0.339, 0.189)

and estimated $τ_{1} = - 0.070$ and $τ_{2} = 0.882$ .

To obtain starting values of $ρ_{m z}$ and $ρ_{d z}$ minimize $F (θ)$ with respect to $ρ_{m z}$ and $ρ_{d z}$ for given $τ_{1}$ and $τ_{2}$ separately for MZ and DZ. This is a special case of the polychoric correlation, see Olsson (Citation1979), which gives $ρ_{m z} = 0.810$ and $ρ_{d z} = 0.456$ . So the starting values for $θ$ are

θ_{0} = (- 0.070, 0.882, 0.810, 0.456)

The maximum likelihood estimates are

\hat{θ} = (- 0.068, 0.872, 0.811, 0.457),

very close to the starting values. The value of the LR statistic $C$ is 31.50 with 12 degrees of freedom. This has a $P$ -value of 0.0017 indicating a not so good fit of the model. The value of the GF statistic $G$ is 31.81, very similar to $C$ .

To examine the fit in detail, one can consider the GF residuals

\frac{n_{i j} - T {\hat{π}}_{i j}}{\sqrt{T {\hat{π}}_{i j}}}

separately for MZ and DZ. These are

(\begin{matrix} 1.81 & 0.49 & 1.34 \\ - 1.60 & 0.24 & - 1.49 \\ 0.95 & - 1.20 & - 1.09 \end{matrix})

for MZ and

(\begin{matrix} - 1.47 & - 0.48 & 2.70 \\ - 0.38 & 1.15 & 0.04 \\ - 2.15 & 1.11 & 0.97 \end{matrix})

for DZ. It is seen that there is only one residual larger than 2 in absolute value. This suggests there are too many twins in the DZ group where twin 1 is in category 1 (normal weight) and twin 2 is in category 3 (obese). There are 46 such twins. According to the model the expected number is 31.

Since the total sample sise $T = 1552$ is very large compared to the number of parameters, it is reasonable to consider RMSEA (Root Mean Square Error of Approximation) as a measure of fit, see Browne and Cudeck (Citation1993). This is $R = \sqrt{(C - d) / T} = 0.03$ indicating that the model fits at least approximately.

The estimated asymptotic covariance matrix is

T A c o v (\hat{θ}) = (\begin{matrix} 1.2104 \\ 0.7182 & 1.4200 \\ 0.0719 & - 0.1734 & 0.5459 \\ - 0.1259 & - 0.4722 & 0.0789 & 2.4473 \end{matrix})

Dividing these numbers by 1552 and taking the square root of the diagonal elements one can obtain the standard errors of the parameter estimates $\hat{θ}$ .

Let

B_{A C E} = (\begin{matrix} 2 & - 2 \\ - 1 & 2 \\ 0 & 0 \end{matrix}) B_{A D E} = (\begin{matrix} - 1 & 4 \\ 0 & 0 \\ 2 & - 4 \end{matrix}) B_{A C D E} = (\begin{matrix} \frac{1}{2} & - \frac{2}{7} \\ - \frac{1}{2} & 1 \frac{13}{7} \\ 1 & - 1 \frac{1}{7} \end{matrix})

Estimates of $h^{2}, c^{2}, d^{2}$ are obtained from

(\begin{matrix} {\hat{h}}^{2} \\ {\hat{c}}^{2} \\ {\hat{d}}^{2} \end{matrix}) = B (\begin{matrix} {\hat{ρ}}_{m z} \\ {\hat{ρ}}_{d z} \end{matrix})

and estimates of their asymptotic covariance matrix are obtained from

T A c o v (\begin{matrix} {\hat{h}}^{2} \\ {\hat{c}}^{2} \\ {\hat{d}}^{2} \end{matrix}) = B (\begin{matrix} 0.5459 \\ 0.0789 & 1.5644 \end{matrix}) B^{'}

The results are summarized in .

Table 1. Parameter estimates (0* indicates a fixed value) and standard errors

Display Table

The value of ${\hat{h}}^{2}$ in compares well the values of $\hat{H}$ in Table 2 of Jöreskog (Citation2020). The ADE model gives non-admissible results in both papers. The values of ${\hat{h}}^{2}$ and $\hat{H}$ for the ACE and ACDE models are fairly close. The standard errors are smaller for $\hat{H}$ . Intuitively, this is expected since the continuous data has more information than the categorical data. As in the previous paper, the ACDE model gives a more reasonable value of heredity of 28% than 71% as given by the ACE model.

References

Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Sage Publications.
Google Scholar
Jöreskog, K. G. (2020, June). Classical models for twin data. Structural Equation Modeling: A Multidisciplinary Journal, 28, 121–126. Published Open Access. https//doi.org/10.1080/10705511.2020.1789465.
Web of Science ®Google Scholar
Neale, C. M., & Cardon, L. R. (1992). Methodology for genetic studies of twins and families. Kluwer Academic Publishers.
Google Scholar
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44, 443–460. https://doi.org/10.1007/BF02296207
Web of Science ®Google Scholar

Classical Models for Twin Data: The Case of Categorical Data

ABSTRACT

Twin data models

Data

Model

Estimation

Example

Table 1. Parameter estimates (0* indicates a fixed value) and standard errors

References

Information for

Open access

Opportunities

Help and information

Classical Models for Twin Data: The Case of Categorical Data

ABSTRACT

Twin data models

Data

Model

Estimation

Example

Table 1. Parameter estimates (0* indicates a fixed value) and standard errors

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date