1,315
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Classical Models for Twin Data: The Case of Categorical Data

ABSTRACT

The classical models ACE and ADE were used in the 1990’s to estimate heredity of a phenotype from data on monozygotic and dizygotic twins. The author extended these models to a model called ACDE with four parameters instead of only three. In that paper, the data were assumed to be continuous. This paper considers the same models in the case where the data is categorical. It is showed how these models can be estimated by maximum likelihood. An example is given based on twin data on BMI from the UK. This is the same data as in the previous paper but in categorized form.

Twin data models

Models for twin data or other types of family relationships were common in the 1980’s and 1990’s see e.g., Neale and Cardon (Citation1992). One has data on monozygotic (MZ) twins and dizygotic (DZ) twins. It is assumed that MZ twins have all their genes in common and dz twins have half of their genes in common. We observe a phenotype on each twin. A phenotype may be anything that one can measure or observe on each twin, such as symptoms (e.g., allergy), illnesses (e.g.,cancer, diabetes, asthma), physical measures (e.g., weight, height), personality traits (e.g.,nervousness), longevity (how long you live) etc.

Jöreskog (Citation2020) covered the case when the phenotypes were continuous and showed how the classical models ACE, ADE, and ACDE could be estimated. This paper covers the case when the phenotypes are categorical. This requires a different methodology.

Data

Thera are Tmz monozygotic twins and Tdz dizygotic twins. Each twin in a twin pair responds on a set of c ordered categories. Typically, c=2,3,4,5. So the data is represented by a two-way contingency table, one for the MZ group and one for the DZ group:

Nmz=n11(mz)n12(mz)n1c(mz)n21(mz)n22(mz)n2c(mz)nc1(mz)nc2(mz)ncc(mz),Ndz=n11(dz)n12(dz)n1c(dz)n21(dz)n22(dz)n2c(dz)nc1(dz)nc2(dz)ncc(dz)

where nij is the number of twins in the sample where twin 1 responds on category i and twin 2 responds on category j. We have

Tmz=i=1cj=1cnij(mz),Tdz=i=1cj=1cnij(dz)

Model

Let πij be the probability of a response in categories i and j as described in the previous section. The model is

(1) πij(mz)=τi1τiτj1τjϕ2(u,v,ρmz)dudvπij(dz)=τi1τiτj1τjϕ2(u,v,ρdz)dudv,(1)

where

ϕ2(u,v,ρ)=12π1ρ2e12(1ρ2)(u22ρuv+v2),

is the standard bivariate normal density with correlation ρ. The thresholds

τ0=<τ1<τ2<<τc1<τc=

are assumed to be the same within and between groups. If there are c categories, there are c1 strictly inceasing thresholds. This choice of thresholds ensures that the only difference between the MZ and DZ group is manifested through the difference between ρmz and ρdz.

Note that πij=πji, for both MZ and DZ. Since the marginal density functions of ϕ2(u,v;ρ) is the standard normal density function independent of ρ, it follows from (1) that the univariate row and column marginals

j=1cπij=j=1cπji,i=1,2,,c,

are equal and the same for both MZ and DZ. Denoting this common marginal as

α=(α1,α2,,αc),

we have αi>0 and i=1cαi=1 Then

τi=Φ1(α1+α2++αi),i=1,2,,c1.

Reversely,

αi=Φ(τi)Φ(τi1),i=1,2,,c.

Φ is the standard normal distribution function and Φ1 is its inverse function. Note that the αi, and hence, the τi do not depend on ρmz or ρdz.

Here we use the same notation and definitions of h2, c2, d2, and e2 as in the previous paper (Jöreskog, Citation2020). Since there is no scale of the categorical measurement, it is assumed that the total variance h2+c2+d2+e2=1. So the only equations we have are

(2) ρmz=h2+c2+d2(2)
(3) ρdz=12h2+c2+14d2(3)

This is two equations in the three unknowns h2,c2,d2. There is no unique solution for h2,c2,d2. However, if d2=0, we can solve for h2 and c2. This gives

h2=2(ρmzρdz),c2=2ρdzρmz

corresponding to the ACE model. Similarly, if c2=0, we can solve for h2 and d2. This gives

h2=4ρdzρmz,d2=2ρmz4ρdz

corresponding to the ADE model.

As argued in the previous paper, neither c2=0 nor d2=0 are reasonable assumptions. Therefore, an alternative model is considered as follows

EquationEquations (2) and (Equation3) can be written in matrix form as

(4) ρmzρdz=11112114h2c2d2(4)

Let A be the matrix in (4) and let B be

B=A(AA)1.

Using paper and pencil algebra, gives

B=1227121371117.

The ACDE model is defined by

h2c2d2=Bρmzρdz

In particalar,

h2=ρmz27ρdz

Estimation

Assuming multinomial distributions of the nij and independent groups, the log likelihood function is

lnL=i=1cj=1cnij(mz)lnπij(mz)+i=1cj=1cnij(dz)lnπij(dz).

In the following the shorter notation ij is used instead of i=1cj=1c. Let

τ=(τ1,τ2,,τc1)

Assuming that the samples are sufficiently large so that all nij>0, it is convenient to minimize the fit function

F(θ)=Fmz(τ,ρmz)+Fdz(τ,ρdz)

where

Fmz(τ,ρmz)=ijnij(mz)lnpij(mz)ijnij(mz)lnπij(mz)

and

Fdz(τ,ρdz)=ijnij(dz)lnpij(dz)ijnij(dz)lnπij(dz)

and where

pij(mz)=nij(mz)/Tmz,pij(dz)=nij(dz)/Tdz.

The parameter vector θ is

θ=(τ1,τ2,,τc1,ρmz,ρdz)

The function F(θ) which is non-negative, is to be minimized with respect to θ. F(θ) can be minimized numerically using first and second derivatives. At the minimum, the inverse of the expected Hessian matrix can be used as an estimate of the asymptotic covariance matrix of θˆ, and

C=2F(θˆ)

can be used as a chi-square statistic with 2(c21)(c+1) degrees of freedom to test the fit of the model. This is an LR (likelihood ratio) statistic. Alternatively, one can use the GF (goodness-of-fit) statistic

G=ij(nij(mz)Tmzπˆij(mz))2Tmzπˆij(mz)+ij(nij(dz)Tdzπˆij(dz))2Tdzπˆij(dz),

which is also approximately distributed as χ2 with the same degrees of freedom.

Example

In this example we use the same BMI data from the UK as in the previous paper, but the measurement from each twin is classified into three categories: normal, overweight, and obese on the basis of their BMI. So c=3, Tmz=794 and Tdz=758. The contingency tables are

Nmz=325788601503874088Ndz=20296469711256196466.

In terms of proportions these are

Pmz=0.40930.09820.01010.07560.18890.04790.00880.05040.1108Pdz=0.26650.12660.06070.12800.14780.07390.02510.08440.0871.

Because of the inequality τ1<τ2 and the restricted range of ρmz and ρdz it is neccessary to have good starting values for the numerical minimization of F(θ). These are obtained as follows.

Take the average of the row and column margins of Pmz and Pdz. Then form a wheighted average of these averages with weights wmz=Tmz/(Tmz+Tdz) and wdz=Tdz/(Tmz+Tdz). This gives α as

α=(0.472,0.339,0.189)

and estimated τ1=0.070 and τ2=0.882.

To obtain starting values of ρmz and ρdz minimize F(θ) with respect to ρmz and ρdz for given τ1 and τ2 separately for MZ and DZ. This is a special case of the polychoric correlation, see Olsson (Citation1979), which gives ρmz=0.810 and ρdz=0.456. So the starting values for θ are

θ0=(0.070,0.882,0.810,0.456)

The maximum likelihood estimates are

θˆ=(0.068,0.872,0.811,0.457),

very close to the starting values. The value of the LR statistic C is 31.50 with 12 degrees of freedom. This has a P-value of 0.0017 indicating a not so good fit of the model. The value of the GF statistic G is 31.81, very similar to C.

To examine the fit in detail, one can consider the GF residuals

nijTπˆijTπˆij

separately for MZ and DZ. These are

  1.81  0.49  1.341.60  0.241.49  0.951.201.09

for MZ and

1.470.482.700.38  1.150.042.15  1.110.97

for DZ. It is seen that there is only one residual larger than 2 in absolute value. This suggests there are too many twins in the DZ group where twin 1 is in category 1 (normal weight) and twin 2 is in category 3 (obese). There are 46 such twins. According to the model the expected number is 31.

Since the total sample sise T=1552 is very large compared to the number of parameters, it is reasonable to consider RMSEA (Root Mean Square Error of Approximation) as a measure of fit, see Browne and Cudeck (Citation1993). This is R=(Cd)/T=0.03 indicating that the model fits at least approximately.

The estimated asymptotic covariance matrix is

TAcov(θˆ)=  1.2104  0.7182  1.4200  0.07190.17340.54590.12590.47220.07892.4473

Dividing these numbers by 1552 and taking the square root of the diagonal elements one can obtain the standard errors of the parameter estimates θˆ.

Let

BACE=221200BADE=140024BACDE=12271211371117

Estimates of h2,c2,d2 are obtained from

hˆ2cˆ2dˆ2=Bρˆmzρˆdz

and estimates of their asymptotic covariance matrix are obtained from

TAcovhˆ2cˆ2dˆ2=B0.54590.07891.5644B

The results are summarized in .

Table 1. Parameter estimates (0* indicates a fixed value) and standard errors

The value of hˆ2 in compares well the values of Hˆ in Table 2 of Jöreskog (Citation2020). The ADE model gives non-admissible results in both papers. The values of hˆ2 and Hˆ for the ACE and ACDE models are fairly close. The standard errors are smaller for Hˆ. Intuitively, this is expected since the continuous data has more information than the categorical data. As in the previous paper, the ACDE model gives a more reasonable value of heredity of 28% than 71% as given by the ACE model.

References