![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
The estimation of parameters is a key component in statistical modelling and inference. However, parametrization of certain likelihood functions could lead to highly correlated estimates, causing numerical problems, mathematical complexities and difficulty in estimation or erroneous interpretation and subsequently inference. In statistical estimation, the concept of orthogonalization is familiar as a simplifying technique that allows parameters to be estimated independently and thus separates information from each other. We introduce a Fisher scoring iterative process that incorporates the Gram–Schmidt orthogonalization technique for maximum likelihood estimation. A finite mixture model for correlated binary data is used to illustrate the implementation of the method with discussion of application to oesophageal cancer data.
Public Interest Statement
In statistical estimation, the concept of orthogonalization is simply a technique that allows parameters to be estimated independently, through the rotation of the parameter space. The paper introduces an iterative algorithm that utilizes the Fisher scoring process and incorporates the Gram–Schmidt orthogonalization technique for maximum likelihood estimation. The algorithm that we propose provides an exceedingly convenient method for multi-parameter statistical problems first to reduce numerical difficulties and second to improve accuracy in parameter estimation.
1. Introduction
The problem of estimating parameters is one of the key stages in fitting a statistical model to a set of data. Typically, the model will contain some parameters which are not of interest in themselves, but whose values affect the inferences that we make about the parameters of direct interest. Parametrization used when working with certain distributions could lead to high correlations of the maximum likelihood estimates. Maximization algorithms converge rapidly if the initial estimates are good, the likelihood function is well approximated by a quadratic in the neighbourhood of the parameter space and the information matrix is well conditioned, which means that the parameter estimates are not strongly intercorrelated. High intercorrelation of parameters causes numerical problems, and difficulty or erroneous interpretation in the parameter estimation. In statistical estimation, the concept of orthogonalization is familiar as a simplifying technique that allows parameters to be estimated independently, through rotation of the parameter space. Computationally, all that is required is that the original parameters should be computable from the new ones, and vice-versa. An injective transformation will therefore be desirable, though not necessary.
Ross (ross) gave a comprehensive discussion of techniques that can be used to reduce correlation in particular problems. Among them are sequential and nested maximizations. Kalbliesch and Sprott (kalbliesch) discuss methods aimed at eliminating nuisance parameters from the likelihood function so that inference can be made about only the parameters of interest. Amari (amari, amari1) derived from a theoretical point of view, a general criteria for the existence and construction of orthogonal parameters for statistical models using differential geometry. Cox and Reid (cox) gave a more general procedure and covered the advantages in maximum likelihood estimation. The approach, though popular in theoretical statistics, is computationally impractical in many application situations. Willmot (willmot) discussed orthogonal parametrization but only for a two-parameter family of discrete distributions. His method, however, does not allow for higher parametric problems. Hurlimann (hurlimann) discusses the existence of orthogonal parameterization to the mean, characterized by a partial differential equation involving the mean, the variance and cummulant generating functions. Godambe (godambe) deals with the problem of nuisance parameters, within a semi-parametric set-up which includes the class of distributions associated with generalized linear models. Their technique uses the optimum orthogonal estimating functions of Godambe and Thompson (godambe1). Bonney and Kissling (bonney) applied the Gram–Schmidt orthogonalization (Clayton, clayton) to multi-normal variates and presented an application in genetics. Kwagyan, Apprey and Bonney (kwagyan1) presented the idea of Gram–Schmidt parameter orthogonalization in genetic analysis. A major consequence of parameter orthogonalization is that the maximum likelihood estimate of one parameter is not affected or changes slowly with the misspecification of the other. The most direct interpreted consequence is that the maximum likelihood estimates of orthogonal coordinates are asymptotically independent.
The remainder of this paper is organized as follows. In Section 2, the disposition model for clustered binary data proposed by Bonney (bonney1, bonney2) and further investigated by Kwagyan (kwagyan) is introduced and used as a motivation. In Section 3, an overview of maximum likelihood estimation and parameter orthogonalization is then presented. Section 4 develops the proposed Gram–Schmidt parameter orthogonalization and iterative scheme for maximum likelihood estimation. In section Section , fitting algorithm of the disposition model is discussed with application to oesophageal cancer data in Chinese families. Finally, Section wraps up with discussion of the methodology and outlines directions for further research.
2. Model for correlated binary data
Consider a sample of N groups, each of size and let
denote the vector of binary outcomes for the ith group. It is postulated that the measures of the outcomes within a group are possibly correlated. Further suppose the jth subject in the ith group has
individual-specific covariates
and let the ith group have group-specific covariates
. Let
denote the conditional probability of
given
; that is
We call the individual disposition which is simply interpreted as the probability of the outcome on one unit given another unit from the same cluster has the attribute. This in essence is an indication of aggregation. Assume further that within the group, a pair of observations satisfy the relation
where is common for all pairs. Clearly,
implies independence of the observations. Thus,
is a measure of the departure from independence and is called the relative disposition. With the above definitions, Bonney (bonney1, bonney2), and from further investigation through a latent mixture formulation by Kwagyan (kwagyan), has shown that the joint distribution for the N groups can be based on
(1)
(1)
The and
are generally modelled as
where is function of mean effects,
is a measure of within-group dependence and
a function describing the effect of the individual covariates. Typically,
and
are modelled, respectively, as
where
and
are unknown parameters to be estimated. When data consist of a few large clusters or truncated or size-biased samples, Bonney (bonney3) has shown that there is little or no information to separate the effects of M(Z) and D(Z). The estimates of the parameters tend to be highly correlated. Similar problems occur in other approaches as well. The popular estimating equations approach, GEE (Liang & Zeger, liang), suffers the same problems when there are only a few large clusters (see discussion of Prentice, prentice).
3. Maximum likelihood estimation & parameter orthogonalization
There are different statistical methods for estimating parameters, but the approach most commonly used is that based on maximum likelihood estimation. Maximum likelihood estimates of parameters are those values of the parameters that make the likelihood function a maximum. Let be observed data from a population with a probability distribution
indexed by the vector of unknown parameters
. The likelihood function for
is the joint distribution
According to the likelihood principle, is regarded as the value of
which is most strongly supported by the observed data. In practice,
is obtained by direct maximization of Log
or by solving the set of equations from the score function,
where
for which the Hessian matrix is negative definite. The matrix
is called Fisher (expected) Information matrix for
and its inverse,
gives the asymptotic variances and covariances of the maximum likelihood estimates.
Suppose is a p parameter vector that is partitioned into two vectors
and
of lengths
and
, respectively,
. Cox and Reid (cox) define
to be orthogonal to
if the elements of the information matrix satisfy
If this holds for all in the parameter space, it is sometimes called global orthogonality. If it holds at only one parameter value say
, then
and
are said to be locally orthogonal at
.
We now discuss Cox and Reid (cox) approach for the construction of orthogonal parameters. Suppose that initially the likelihood is specified in terms of ,
and let
be the log-likelihood function. Assume
is the parameter of interest and
the set of nuisance parameters. We seek a transformation from
to a new set of orthogonal parameters
,
. It is easiest to think of the original parameters
as some function of the new parameters,
, that is
Then, the log-likelihood function can be expressed as
where now refers to the log-likelihood in the new parametrization. Taking derivatives of this equation with respect to the new parameters, we have by the chain rule;
and(2)
(2)
The derivatives of are not functions of the data, and hence are constant with respect to expectation. Therefore, after taken expectations with respect to the distribution of the data indexed by the parameters
,the last term in Equation 3.1 is zero. If the parameters are orthogonal, then again from Equation 3.1,
so that the orthogonality equations are
We require the transformation from to
to have a non-zero Jacobian; hence,
This is a system of p differential equations which must be solved for . In fact, since
does not enter explicitly into the equations, the solution for
can contain an arbitrary function of
as the integrating constant. It is noted in the discussion of Cox and Reid that although it is sometimes theoretically possible to find a differential equation, simple explicit solutions of the differential equation were not feasible for the some models. There are also cases where explicit solutions are possible, but the original nuisance parameters could not be explicitly expressed in terms of the orthogonal parameter (Hills, hills). In general, global orthogonalization can also not be achieved by this approach.
4. Gram–Schmidt parameter orthogonalization
The Gram–Schmidt orthogonalization process (Clayton, clayton) is equivalent to the linear transformation to
defined as:
(3)
(3)
In linear regression set-up, is
adjusted for
and
are the multiple regression coefficients. Writing this using matrix notation, we have
where , the transformation matrix, is lower triangular with ones along the diagonal. The transformation matrix is chosen so that
are mutually uncorrelated. The Jacobian of the transformation is unity. Suppose
is the covariance matrix of
, then the covariance matrix of
is a diagonal matrix. We recommend that the parameters be ordered in terms of interest. This ensures that the parameters of most interest are least affected by round-off errors.
4.1. Evaluation of the transformation matrix
To evaluate the transformation matrix, , let
where
are elements of the covariance matrix. The orthogonality relation,
implies that for
,
Thus, the system of linear equations to be solved for entries of the transformation matrix based on the covariance matrix is:(4)
(4)
Consequently, solving the system of linear equations (4.2), we obtain the elements of the transformation matrix as(5)
(5)
where are entries of the information matrix.The observed information matrix would be a good approximation of the expected information, if there is difficulty evaluating it.
For illustration, when , we have Gram–Schmidt orthogonalization process in matrix notation as,
The orthogonalization relationship of the parameters implies
And so the system of linear equations to be solved for the elements of the transformation matrix is:
Let Q be the matrix of coefficients for the system of equations (4.4) Above; then, we note that Q is a patterned block diagonal matrix. Furthermore, let be the block diagonal of Q; then, in this illustration,
where
Thus, is the covariance matrix of the first r parameters. It can easily be shown that
is symmetric and positive definite and so Q is non-singular. A unique solution thus exists for the system of linear Equation (4.3) and in general for Equation (4.2). And so, the elements of the transformation matrix, B, can be obtained as
or
where are the entries of the information matrix .
Having obtained the original parameters,
can be obtained recursively as
or writing this using matrix notation, we have
4.2. Block orthogonalization
Let be a set of parameters to be estimated where
Suppose further that the set is correlated. Then, we wish to find a new set
through a linear transformation such that the vector of parameters in
is mutually uncorrelated. We allow for correlation, if any, within each set of parameters in
.
Further suppose the vector is the set of parameters of interest. Then, the Gram–Schmidt orthogonalization process computes the new set of parameters in terms of the original parameters as follows:
where and
are matrices with dimensions
,
and
, respectively.
In matrix notation, we write
or
where is a lower triangular block matrix whose diagonal unit matrix is the transformation matrix chosen such that
are mutually uncorrelated and where
is a block diagonal matrix. The Jacobian of the transformation is unity. The only way in which this procedure could break down would be if one of the vectors of
is identically zero. From the method of formation of
, it is clear that it is a linear combination of the vectors in
. If
is identically zero, this means that
are linearly dependent, thus contradicting the assumption of independence. Extensions to more than three sets of vectors of parameters readily follow.
4.3. Gram–Schmidt–Fisher scoring algorithm
We introduce a modification of the Fisher scoring algorithm incorporating the Gram–Schmidt process to obtain an information matrix which is diagonal and thus ensuring the approximate (near) orthogonality and consequently the stability of the estimates of the new parameters. Since the transformation is linear and injective, the original parameters are readily obtained.
Let be the log-likelihood in the original Parametrization; then, Fisher scoring algorithm is given by the iterative routine
(6)
(6)
where is the expected information matrix and
is the score function.
Suppose is transformed to
through the Gram–Schmidt orthogonalization process. Let
be the matrix of transformation, defined as in Equation (3.4). Then, since
is square and non-singular, we can write from Equation (4.5).
(7)
(7)
where
We shall call Equation (4.6) the Gram–Schmidt–Fisher scoring algorithm.
4.3.1. Algorithmic process
The proposed Gram–Schmit–Fisher scoring algorithm is a 2-stage iterative process that oscillates between Equations (4.3) and (4.6) until convergence and is described iteratively as follows:
1 | = | Start with an initial estimate of the original parameter, |
2 | = | Estimate |
3 | = | Determine the initial estimate of the orthogonal parameterization, |
4 | = | Update |
5 | = | Update |
6 | = | Repeat steps 2 through 5 using |
7 | = | Stop when |
4.4. Example
We consider a sample problem which concerns inference about the difference between two exponential means. Let and
be the independent exponential random variables with means
and
, respectively. Then, the joint distribution is given by the function
The score vector is
The information matrix is
and its inverse, the variance–covariance matrix, is
4.4.1. Cox and Reid approach
Orthogonal parametrization, following Cox–Reid method, requires solving the differential equation
This can be solved by separation of variables, leading to where
is an arbitrary function of
Cox–Reid suggest setting
as a suitable choice. Clearly, this does not produce a unique solution for
, regardless of the parametrization of
Thus, different reparametrizations may lead to different modified likelihoods, so that a Cox–Reid estimator may not exist or there may be many of them.
4.4.2. Gram–Schmidt approach
The proposed Gram–Schmidt approach requires seeking a transformation to
such that
and solving the equation
From the variance–covariance matrix, we obtain
Just like the Cox–Ried approach, it is impractical to obtain or express uniquely in terms of
and
and subsequently formulate a modified approximate orthogonal joint distribution function. Therefore, one could proceed iteratively, using the proposed Gram–Schmidt–Fisher scoring algorithm.
5. Fitting the disposition model
We will present procedures for estimating the parameters in the correlated model Equation (2.1). The calculations are standard, and so we will only outline the results. We write the likelihood of the joint distribution of the ith group as
where
Let
and define the following
The contribution of the i-th group to the log-likelihood is the term
and the contribution to the score function is the term
where
Setting , we obtain the score equations. Closed-form solutions are not possible and so the equations are solved by iterative procedures to obtain the maximum likelihood estimates of the parameters.
The contribution of the i-th group to the Hessian matrix is the term
Estimates of the parameter vector can be obtained by the Newton–Raphson iteration routine, which is given by the updating the formula
where is the estimate of the sth iteration.
The contribution of the i-th group to the Fisher information matrix is the term
where
and where and
are evaluated at
. The estimates can be alternatively obtained by the Fisher scoring method which is given by
The asymptotic variance–covariance matrix of the parameter estimates, , is the inverse of the information matrix. Thus, the transformation matrix for use of the proposed Gram–Schmidt–Fisher scoring algorithm can be obtained as described from Equation 4.4. Specifically, the transformation matrix, a lower triangular matrix with ones along the diagonal, is
, where
. and where
and
are entries of the variance–covariance matrix ,
and the information matrix,
respectively.
The parameter estimates can subsequently be obtained by the Gram–Schmidt–Fisher scoring algorithm given by
where
5.1. Application to oesophageal cancer in Chinese families
This application involves the study of oesophageal cancer in 2951 nuclear families collected in Yangcheng County, Shanxi Province in China (Kwagyan, kwagyan). In this study, we consider as a group the nuclear family unit. The outcome variable is whether an individual is affected with oesophageal cancer or not. The objective of the study is to assess the presence and aggregation of oesophageal cancer in these families. Table summarizes the distribution of number of affected individuals by family size. Of the 2951 families, 1580 (53%) had no affected individuals. The combined total number of individuals from the studied families was 14310, with mean sd age of 48.26
18.17 years. Males comprise 56.4 and 25.4% indicated drinking alcohol.
Table 1. Distribution of family size by number of affecteds
The respondent within a family has correlated outcomes which are influenced in part or wholly by the group variables as well as the variables on the individual respondent. The main objective here is to assess the presence of familial aggregation of oesophageal cancer adjusting for measured risk factors: sex, age and alcohol consumption. Here, the model for the regression analysis of disposition is parametrization as
In this application, the vector of parameters is the set of parameters of interest. Computations are performed using the computer program we developed CORRDAT (Bonney, Kwagyan, Slater, & Slifker, bonney5), which was linked with the likelihood optimization software MULTIMAX (Bonney, Kwagyan, & Apprey, bonney4). Computations can also be accomplished in MATLAB. Table gives estimates of the correlation matrices. The correlations between the original parameters are quite high compared to those of the orthogonal parameters. In particular, the correlation between
and
is
; the correlation between
and
is
; and that between
and
is high,
. The correlations between the orthogonal parameters are near zero or nonexistent. The correlation between
is and
is
; the correlation between
and
is now low, 0.004 and that between
and
is also low, 0.0188.
As expected, the orthogonal likelihood converged in fewer iterations than the non-orthogonal one—the orthogonal likelihood converged after 17 iterations; the non-orthogonal one converged after 56 iterations. Estimates of the parameters are summarized in Table . The maximum likelihood estimate of the relative dependence parameter, with a corresponding asymptotic 95% confidence interval of (0.529, 0.625) suggests that there is significant familial aggregation of oesophageal cancer in the families sampled. Sex and age have a positive significance, while alcohol was negative. The results suggest that males are at a higher risk of getting oesophageal cancer than females and also that it is more prevalent in older people. The negative effect of alcohol seems to suggest that it has the propensity to lower the risk of oesophageal cancer in the Chinese population studied, perhaps if drank in moderation. In summary, we conclude that oesophageal cancer aggregates in the families sampled.
Table 2. Correlation matrix from original and orthogonal parametrization. Fisher scoring algorithm used for original parametrization and Gram–Schmidt–Fisher scoring algorithm used of orthogonal parametrization
Table 3. Regression analysis of disposition to oesophageal cancer. Fisher scoring algorithm used for original parametrization and Gram–Schmidt–Fisher scoring algorithm for orthogonal parametrization
6. Conclusions
Parameter orthogonalization is used as an aid in computation, approximation, interpretation and elimination or reduction of the effect of nuisance parameters. The resulting algorithm that we have proposed based on the Gram–Schmidt orthogonalization process is computationally feasible. Unlike the approach of Cox–Reid which requires solving a system of differential equations to obtain orthogonality, the proposed method only requires solving a system of linear equations. Our approach has some similarities with the conjugate gradient algorithm, but is distinctly different from it. The conjugate gradient method is an optimization routine, whereas the method we have proposed is principally an approximate orthogonalization technique combined with a maximization algorithm to aid in parameter estimation. Amari (amari1) claims that global orthogonalization is possible only in a special case. Our method allows for global orthogonalization, if the information matrix can be found or estimated. The approach we have proposed is based on exact calculations. The transformation is surjective, that is the original parameters can be sufficiently obtained. Clearly, the process can be cumbersome if a large number of parameters must be accurately estimated because of the inversion of the information matrix at every stage to obtain the covariance matrix and subsequently the transformation matrix. The advantages, however, are that convergence is generally rapid in iteration times, sure and accurate. Block parametrization of parameters would be desirable if interest is in sets of parameters and where the dimension of the parameter space is large. The work we have presented in this article is the first attempt to consider the use of Gram–Schmidt process for estimation of the parameters. It is possible, however, that closer scrutiny, practical considerations, numerical studies for numerical stability and properties, and simulation studies to evaluate and compare convergence times would suggest modifications and refinements to the methods we have discussed.
In conclusion, the algorithm that we have proposed provides an exceedingly convenient method for multi-parameter statistical problems first to reduce numerical difficulties and second improve accuracy in parameter estimation. The method would be efficient for use of function minimization in both linear and non-linear maximum likelihood estimations and particularly useful for small parameter space estimation problems.
Additional information
Funding
Notes on contributors
John Kwagyan
Dr John Kwagyan is a mathematician and a public health and medical researcher. His research interests include mathematical and statistical modelling of correlated data, clinical trials, survival models, big data analytics, statistical genetics and pharmacokinetic & pharmacodynamics modeling. He is currently the director of Biostatistics, Epidemiology and Research Design, of the Georgetown-Howard University Center for Clinical and Translational Science, Howard University College of Medicine and adjunct professor in the Department of Mathematics, Howard University, Washington, DC, USA.
References
- Amari, S. I. (1982). Differential geometry of curved exponential families- curvatures and information loss. Annals of Statistics, 10, 357–85.
- Amari, S. I. (1985). Differential geometry in statistics. New York, NY: Springer Verlag.
- Bonney, G. E., & Kissling, G. E. (1986). Gram--Schmidt orthogonalization of multinormal variates: Applications in genetics. Biometrical Journal, 28, 417–425.
- Bonney, G. E. (1995). Some new results on regressive models in family studies. Proceedings of the Biometrics Section, American Statistical Association, 177–182.
- Bonney, G. E. (1998a). Regression analysis of disposition to correlated binary outcomes (Scientific Report). Philadelphia, PA: Fox Chase Cancer Center.
- Bonney, G. E. (1998b). Regression analysis of disposition to correlated binary outcomes. Unpublished Manuscript.
- Bonney, G. E., Kwagyan, J., & Apprey, V. (1997). MULTIMAX-A computer package for MULTI-objective MAXimization with applications in genetics and epidemiology. The American Journal of Human Genetics, 61, 447.
- Bonney, G. E., Kwagyan, J., Slater, E., & Slifker, M. (1997). CORRDAT-A computer package for CORRelated DATa. The American Journal of Human Genetics, 61, A194.
- Cox, D. R., & Reid, N. (1987). Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society: Series B, 49, 1–39.
- Cox, D. R., & Reid, N. (1989). On the stability of maximum likelihood estimators of orthogonal parameters. Canada Journal of Statistics, 17, 229–233.
- Clayton, D. G. (1971). Gram-Schmidt orthogonalization. Journal of the Royal Statistical Society: Series C (Applied Statistics), 20, 335–338.
- Godambe, V. P. (1991). Orthogonality of estimating functions and nuisance parameters. Biometrika, 78, 143–151.
- Godambe, V. P., & Thompson, M. E. (1989). An extension of quasi-likelihood estimation (with discussion). Journal of Statistical Planning and Inference, 22, 137–72.
- Hills, S. E. (1987). Parameter orthogonality and approximate conditional inference [Discussion]. Journal of the Royal Statistical Society: Series B, 49, 1–39.
- Hurlimann, W. (1992). On parameter orthogonality of the mean. Statistical Papers, 33, 69–74.
- Kalbliesch, J. D., & Sprott, D. A. (1970). Application of likelihood methods to models involving large number of parameters. Journal of the Royal Statistical Society. Series B (Methodological), 32, 175–208.
- Kwagyan, J. (2001). Further Investigation of the disposition Model for correlated binary outcomes ( PhD Thesis). Philadelphia, PA: Department of Statistics, Temple University.
- Kwagyan, J., Apprey, V., & Bonney, G. E. (2001). Parameter orthogonalization in genetic analysis. Genetic Epidemiology, 21, 163 IGES 059.
- Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.
- Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, 1088–1048.
- Ross, G. J. S. (1970). The efficient use of function minimization in non-linear maximum-likelihood estimation. Applied Statistics, 19, 205–221.
- Willmot, G. E. (1988). Parameter orthogonality for a family of discrete distributions. The Journal of the Acoustical Society of America, 83, 517–521.