![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
The classical models ACE and ADE were used in the 1990’s to estimate heredity of a phenotype from data on monozygotic and dizygotic twins. The author extended these models to a model called ACDE with four parameters instead of only three. In that paper, the data were assumed to be continuous. This paper considers the same models in the case where the data is categorical. It is showed how these models can be estimated by maximum likelihood. An example is given based on twin data on BMI from the UK. This is the same data as in the previous paper but in categorized form.
Twin data models
Models for twin data or other types of family relationships were common in the 1980’s and 1990’s see e.g., Neale and Cardon (Citation1992). One has data on monozygotic (MZ) twins and dizygotic (DZ) twins. It is assumed that MZ twins have all their genes in common and dz twins have half of their genes in common. We observe a phenotype on each twin. A phenotype may be anything that one can measure or observe on each twin, such as symptoms (e.g., allergy), illnesses (e.g.,cancer, diabetes, asthma), physical measures (e.g., weight, height), personality traits (e.g.,nervousness), longevity (how long you live) etc.
Jöreskog (Citation2020) covered the case when the phenotypes were continuous and showed how the classical models ACE, ADE, and ACDE could be estimated. This paper covers the case when the phenotypes are categorical. This requires a different methodology.
Data
Thera are monozygotic twins and
dizygotic twins. Each twin in a twin pair responds on a set of
ordered categories. Typically,
. So the data is represented by a two-way contingency table, one for the MZ group and one for the DZ group:
where is the number of twins in the sample where twin 1 responds on category
and twin 2 responds on category j. We have
Model
Let be the probability of a response in categories
and
as described in the previous section. The model is
where
is the standard bivariate normal density with correlation . The thresholds
are assumed to be the same within and between groups. If there are categories, there are
strictly inceasing thresholds. This choice of thresholds ensures that the only difference between the MZ and DZ group is manifested through the difference between
and
.
Note that for both MZ and DZ. Since the marginal density functions of
is the standard normal density function independent of
, it follows from (1) that the univariate row and column marginals
are equal and the same for both MZ and DZ. Denoting this common marginal as
we have and
Then
Reversely,
is the standard normal distribution function and
is its inverse function. Note that the
, and hence, the
do not depend on
or
.
Here we use the same notation and definitions of ,
,
, and
as in the previous paper (Jöreskog, Citation2020). Since there is no scale of the categorical measurement, it is assumed that the total variance
. So the only equations we have are
This is two equations in the three unknowns . There is no unique solution for
. However, if
, we can solve for
and
. This gives
corresponding to the ACE model. Similarly, if , we can solve for
and
. This gives
corresponding to the ADE model.
As argued in the previous paper, neither nor
are reasonable assumptions. Therefore, an alternative model is considered as follows
EquationEquations (2)(2)
(2) and (Equation3
(3)
(3) ) can be written in matrix form as
Let A be the matrix in (4) and let be
Using paper and pencil algebra, gives
The ACDE model is defined by
In particalar,
Estimation
Assuming multinomial distributions of the and independent groups, the log likelihood function is
In the following the shorter notation is used instead of
. Let
Assuming that the samples are sufficiently large so that all , it is convenient to minimize the fit function
where
and
and where
The parameter vector is
The function which is non-negative, is to be minimized with respect to
.
can be minimized numerically using first and second derivatives. At the minimum, the inverse of the expected Hessian matrix can be used as an estimate of the asymptotic covariance matrix of
, and
can be used as a chi-square statistic with degrees of freedom to test the fit of the model. This is an LR (likelihood ratio) statistic. Alternatively, one can use the GF (goodness-of-fit) statistic
which is also approximately distributed as with the same degrees of freedom.
Example
In this example we use the same BMI data from the UK as in the previous paper, but the measurement from each twin is classified into three categories: normal, overweight, and obese on the basis of their BMI. So ,
and
. The contingency tables are
In terms of proportions these are
Because of the inequality and the restricted range of
and
it is neccessary to have good starting values for the numerical minimization of
. These are obtained as follows.
Take the average of the row and column margins of and
. Then form a wheighted average of these averages with weights
and
. This gives
as
and estimated and
.
To obtain starting values of and
minimize
with respect to
and
for given
and
separately for MZ and DZ. This is a special case of the polychoric correlation, see Olsson (Citation1979), which gives
and
. So the starting values for
are
The maximum likelihood estimates are
very close to the starting values. The value of the LR statistic is 31.50 with 12 degrees of freedom. This has a
-value of 0.0017 indicating a not so good fit of the model. The value of the GF statistic
is 31.81, very similar to
.
To examine the fit in detail, one can consider the GF residuals
separately for MZ and DZ. These are
for MZ and
for DZ. It is seen that there is only one residual larger than 2 in absolute value. This suggests there are too many twins in the DZ group where twin 1 is in category 1 (normal weight) and twin 2 is in category 3 (obese). There are 46 such twins. According to the model the expected number is 31.
Since the total sample sise is very large compared to the number of parameters, it is reasonable to consider RMSEA (Root Mean Square Error of Approximation) as a measure of fit, see Browne and Cudeck (Citation1993). This is
indicating that the model fits at least approximately.
The estimated asymptotic covariance matrix is
Dividing these numbers by 1552 and taking the square root of the diagonal elements one can obtain the standard errors of the parameter estimates .
Let
Estimates of are obtained from
and estimates of their asymptotic covariance matrix are obtained from
The results are summarized in .
Table 1. Parameter estimates (0* indicates a fixed value) and standard errors
The value of in compares well the values of
in Table 2 of Jöreskog (Citation2020). The ADE model gives non-admissible results in both papers. The values of
and
for the ACE and ACDE models are fairly close. The standard errors are smaller for
. Intuitively, this is expected since the continuous data has more information than the categorical data. As in the previous paper, the ACDE model gives a more reasonable value of heredity of 28% than 71% as given by the ACE model.
References
- Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Sage Publications.
- Jöreskog, K. G. (2020, June). Classical models for twin data. Structural Equation Modeling: A Multidisciplinary Journal, 28, 121–126. Published Open Access. https//doi.org/10.1080/10705511.2020.1789465.
- Neale, C. M., & Cardon, L. R. (1992). Methodology for genetic studies of twins and families. Kluwer Academic Publishers.
- Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44, 443–460. https://doi.org/10.1007/BF02296207