1,397
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Gram–Schmidt–Fisher scoring algorithm for parameter orthogonalization in MLE

, & | (Reviewing Editor)
Article: 1159847 | Received 19 Sep 2015, Accepted 17 Feb 2016, Published online: 25 Mar 2016

Abstract

The estimation of parameters is a key component in statistical modelling and inference. However, parametrization of certain likelihood functions could lead to highly correlated estimates, causing numerical problems, mathematical complexities and difficulty in estimation or erroneous interpretation and subsequently inference. In statistical estimation, the concept of orthogonalization is familiar as a simplifying technique that allows parameters to be estimated independently and thus separates information from each other. We introduce a Fisher scoring iterative process that incorporates the Gram–Schmidt orthogonalization technique for maximum likelihood estimation. A finite mixture model for correlated binary data is used to illustrate the implementation of the method with discussion of application to oesophageal cancer data.

Public Interest Statement

In statistical estimation, the concept of orthogonalization is simply a technique that allows parameters to be estimated independently, through the rotation of the parameter space. The paper introduces an iterative algorithm that utilizes the Fisher scoring process and incorporates the Gram–Schmidt orthogonalization technique for maximum likelihood estimation. The algorithm that we propose provides an exceedingly convenient method for multi-parameter statistical problems first to reduce numerical difficulties and second to improve accuracy in parameter estimation.

1. Introduction

The problem of estimating parameters is one of the key stages in fitting a statistical model to a set of data. Typically, the model will contain some parameters which are not of interest in themselves, but whose values affect the inferences that we make about the parameters of direct interest. Parametrization used when working with certain distributions could lead to high correlations of the maximum likelihood estimates. Maximization algorithms converge rapidly if the initial estimates are good, the likelihood function is well approximated by a quadratic in the neighbourhood of the parameter space and the information matrix is well conditioned, which means that the parameter estimates are not strongly intercorrelated. High intercorrelation of parameters causes numerical problems, and difficulty or erroneous interpretation in the parameter estimation. In statistical estimation, the concept of orthogonalization is familiar as a simplifying technique that allows parameters to be estimated independently, through rotation of the parameter space. Computationally, all that is required is that the original parameters should be computable from the new ones, and vice-versa. An injective transformation will therefore be desirable, though not necessary.

Ross (ross) gave a comprehensive discussion of techniques that can be used to reduce correlation in particular problems. Among them are sequential and nested maximizations. Kalbliesch and Sprott (kalbliesch) discuss methods aimed at eliminating nuisance parameters from the likelihood function so that inference can be made about only the parameters of interest. Amari (amari, amari1) derived from a theoretical point of view, a general criteria for the existence and construction of orthogonal parameters for statistical models using differential geometry. Cox and Reid (cox) gave a more general procedure and covered the advantages in maximum likelihood estimation. The approach, though popular in theoretical statistics, is computationally impractical in many application situations. Willmot (willmot) discussed orthogonal parametrization but only for a two-parameter family of discrete distributions. His method, however, does not allow for higher parametric problems. Hurlimann (hurlimann) discusses the existence of orthogonal parameterization to the mean, characterized by a partial differential equation involving the mean, the variance and cummulant generating functions. Godambe (godambe) deals with the problem of nuisance parameters, within a semi-parametric set-up which includes the class of distributions associated with generalized linear models. Their technique uses the optimum orthogonal estimating functions of Godambe and Thompson (godambe1). Bonney and Kissling (bonney) applied the Gram–Schmidt orthogonalization (Clayton, clayton) to multi-normal variates and presented an application in genetics. Kwagyan, Apprey and Bonney (kwagyan1) presented the idea of Gram–Schmidt parameter orthogonalization in genetic analysis. A major consequence of parameter orthogonalization is that the maximum likelihood estimate of one parameter is not affected or changes slowly with the misspecification of the other. The most direct interpreted consequence is that the maximum likelihood estimates of orthogonal coordinates are asymptotically independent.

The remainder of this paper is organized as follows. In Section 2, the disposition model for clustered binary data proposed by Bonney (bonney1, bonney2) and further investigated by Kwagyan (kwagyan) is introduced and used as a motivation. In Section 3, an overview of maximum likelihood estimation and parameter orthogonalization is then presented. Section 4 develops the proposed Gram–Schmidt parameter orthogonalization and iterative scheme for maximum likelihood estimation. In section Section , fitting algorithm of the disposition model is discussed with application to oesophageal cancer data in Chinese families. Finally, Section wraps up with discussion of the methodology and outlines directions for further research.

2. Model for correlated binary data

Consider a sample of N groups, each of size ni,i=1,,N; and let Yi=(Yi1,,Yini)T denote the vector of binary outcomes for the ith group. It is postulated that the measures of the outcomes within a group are possibly correlated. Further suppose the jth subject in the ith group has 1×p individual-specific covariates Xij=(xij1,,xijp) and let the ith group have group-specific covariates Zi=(z1,,zq). Let δij denote the conditional probability of Yij=1 given Yij=1; that isδij=P(Yij=1|Yij=1),jj,j,j=1,2,,n.

We call δij the individual disposition which is simply interpreted as the probability of the outcome on one unit given another unit from the same cluster has the attribute. This in essence is an indication of aggregation. Assume further that within the group, a pair of observations satisfy the relationP(Yij=1,Yij=1)P(Yij=1)P(Yij=1)=1αi,αi>0,jj,j,j=1,2,,n.

where αi is common for all pairs. Clearly, αi=1 implies independence of the observations. Thus, αi is a measure of the departure from independence and is called the relative disposition. With the above definitions, Bonney (bonney1, bonney2), and from further investigation through a latent mixture formulation by Kwagyan (kwagyan), has shown that the joint distribution for the N groups can be based on(1) L(Θ)=i=1N(1-αi)j=1ni(1-yij)+αij=1niδijyij1-δij1-yij(1)

The αi and δij are generally modelled asαi=1+e[-{M(Zi)+D(Zi)}]1+e-M(Zi),δij=11+e-[M(Zi)+D(Zi)+W(Xij)]

where M(Zi) is function of mean effects, D(Zi) is a measure of within-group dependence and W(Xij) a function describing the effect of the individual covariates. Typically, M(Zi),D(Zi) and W(Xij) are modelled, respectively, asM(Zi)=λ0+λ1Z1++λpZpD(Zi)=γ0+γ1Z1++γqZqW(Xij)=β1X1+β2X2++βrXr

whereΘ={Λ,Γ,β}

andΛ=γ0,,γq,Γ=λ0,,λq},β={β1,,βr}

are unknown parameters to be estimated. When data consist of a few large clusters or truncated or size-biased samples, Bonney (bonney3) has shown that there is little or no information to separate the effects of M(Z) and D(Z). The estimates of the parameters tend to be highly correlated. Similar problems occur in other approaches as well. The popular estimating equations approach, GEE (Liang & Zeger, liang), suffers the same problems when there are only a few large clusters (see discussion of Prentice, prentice).

3. Maximum likelihood estimation & parameter orthogonalization

There are different statistical methods for estimating parameters, but the approach most commonly used is that based on maximum likelihood estimation. Maximum likelihood estimates of parameters are those values of the parameters that make the likelihood function a maximum. Let y=(y1,y2,,yn) be observed data from a population with a probability distribution P(y;θ) indexed by the vector of unknown parameters θ=(θ1,,θp). The likelihood function for θ is the joint distributionL(θ;y)=i=1nP(yi;θ)

According to the likelihood principle, θ^ is regarded as the value of θ which is most strongly supported by the observed data. In practice, θ^ is obtained by direct maximization of Log L(θ) or by solving the set of equations from the score function, U(θ),whereU(θ)=logLθ=0

for which the Hessian matrix H(θ)=(2logLθθT) is negative definite. The matrix I(θ)=E[-H(θ)]=E(-2logLθθT) is called Fisher (expected) Information matrix for θ and its inverse, Ω(θ)=[I(θ)]-1 gives the asymptotic variances and covariances of the maximum likelihood estimates.

Suppose θ is a p parameter vector that is partitioned into two vectors θ1 and θ2 of lengths p1 and p2, respectively, p1+p2=p . Cox and Reid (cox) define θ1 to be orthogonal to θ2 if the elements of the information matrix satisfyElθslθt=E-2lθsθt=0,s=1,,p1;t=p1+1,,p

If this holds for all θ in the parameter space, it is sometimes called global orthogonality. If it holds at only one parameter value say θ0, then θ1 and θ2 are said to be locally orthogonal at θ0.

We now discuss Cox and Reid (cox) approach for the construction of orthogonal parameters. Suppose that initially the likelihood is specified in terms of (θ,γ) , γ=(γ1,,γp) and let l(θ,γ) be the log-likelihood function. Assume θ is the parameter of interest and γ the set of nuisance parameters. We seek a transformation from (θ,γ) to a new set of orthogonal parameters (θ,λ), λ=(λ1,,λp). It is easiest to think of the original parameters γ as some function of the new parameters, λ , that isγ=γ(θ,λ)

Then, the log-likelihood function can be expressed asl(θ,λ)=l(θ,γ1(θ,λ),,γp(θ,λ)),

where now l refers to the log-likelihood in the new parametrization. Taking derivatives of this equation with respect to the new parameters, we have by the chain rule;lθ=lθ+j=1plγjγjθ

and(2) 2lθλk=lp2lθγlγlλk+jplp2lγjγlγlθkγjθ+lplγl2γlθλk(2)

The derivatives of γ are not functions of the data, and hence are constant with respect to expectation. Therefore, after taken expectations with respect to the distribution of the data indexed by the parameters {θ,γ} ,the last term in Equation 3.1 is zero. If the parameters are orthogonal, then again from Equation 3.1,l=1pγlλkE2lθγl+j=1pE2lγlγjγjθ=0.

so that the orthogonality equations arel=1pγlλkE2lθγl+j=1pE2lγlγjγjθ=0,k=1,,p.

We require the transformation from (θ,γ) to (θ,λ) to have a non-zero Jacobian; hence,j=1pE2lγlγkγjθ=-E2lθγk,k=1,,p

This is a system of p differential equations which must be solved for γ(θ,λ). In fact, since λ does not enter explicitly into the equations, the solution for γj can contain an arbitrary function of λ as the integrating constant. It is noted in the discussion of Cox and Reid that although it is sometimes theoretically possible to find a differential equation, simple explicit solutions of the differential equation were not feasible for the some models. There are also cases where explicit solutions are possible, but the original nuisance parameters could not be explicitly expressed in terms of the orthogonal parameter (Hills, hills). In general, global orthogonalization can also not be achieved by this approach.

4. Gram–Schmidt parameter orthogonalization

The Gram–Schmidt orthogonalization process (Clayton, clayton) is equivalent to the linear transformation Θ=(θ1,,θp) to Θ=(θ1,,θp) defined as:(3) θ1=θ1θ2=θ2-b21θ1-------θj=θj-k=1j-1bj(j-k)θj-k,j=2,,p(3)

In linear regression set-up, θj is θj adjusted for θ1,,θp, and bjk,j=2,,p;k=1,,(j-1) are the multiple regression coefficients. Writing this using matrix notation, we haveΘ=BΘ

where B, the transformation matrix, is lower triangular with ones along the diagonal. The transformation matrix is chosen so that θ1,θ2,,θp are mutually uncorrelated. The Jacobian of the transformation is unity. Suppose Ω is the covariance matrix of Θ, then the covariance matrix ofΘ,Ω=BΩBT is a diagonal matrix. We recommend that the parameters be ordered in terms of interest. This ensures that the parameters of most interest are least affected by round-off errors.

4.1. Evaluation of the transformation matrix

To evaluate the transformation matrix, B, let Σ=(cij),i,j=1,,p, where cij=cov(θi,θj)=E(-2logLθiθj)-1, are elements of the covariance matrix. The orthogonality relation, cij=cov(θi,θj)=0(ij) implies that for i<j,0=cij=cov(θi,θj)=cov(θi-k=1i-1bi(i-k)θi-k,θj-k=1j-1bj(j-k)θj-k)=cov(θi,θj)-k=1j-1bj(j-k)cov(θi,θj-k)-k=1i-1bi(i-k)cov(θj-k,θj)+k=1j-1bj(j-k)cov(θi-k,θj-k)=cij-k=1j-1bj(j-k)ci,j-k-k=1i-1bi(i-k)cj-k,j+k=1j-1bj(j-k)ci-k,j-k=cij-k=1j-1bj(j-k)ci,j-k-k=1i-1bi(i-k)cj-k,j=cij-k=1j-1bj(j-k)ci,j-k,sincecj-k,j=0

Thus, the system of linear equations to be solved for entries of the transformation matrix based on the covariance matrix is:(4) cij=k=1j-1bj(j-k)ci(j-k),i=1,(j-1),j=2,,p.(4)

Consequently, solving the system of linear equations (4.2), we obtain the elements of the transformation matrix as(5) bjk=r=1j-1crjIr(j-k),j=2,,p;k=1,,(j-1).(5)

where Ir(j-k) are entries of the information matrix.The observed information matrix would be a good approximation of the expected information, if there is difficulty evaluating it.

For illustration, when p=4, we have Gram–Schmidt orthogonalization process in matrix notation as,θ1θ2θ3θ4=1000-b21100-b31-b3210-b41-b42-b431θ1θ2θ3θ4

The orthogonalization relationship of the parameters (θ1,θ2,θ3,θ4) implies0=c12-b21c110=c13-b32c12-b31c110=c23-b32c22-b31c120=c14-b43c13-b42c12-b41c110=c24-b43c23-b42c22-b41c120=c34-b43c33-b42c23-b41c13

And so the system of linear equations to be solved for the elements of the transformation matrix is:

Let Q be the matrix of coefficients for the system of equations (4.4) Above; then, we note that Q is a patterned block diagonal matrix. Furthermore, let Dr(r=1,2,3), be the block diagonal of Q; then, in this illustration,Q=D1000D2000D3

whereD1=(c11),D2=c11c12c12c22,D3=c11c12c13c12c22c23c13c23c33

Thus, Dr is the covariance matrix of the first r parameters. It can easily be shown that Dr is symmetric and positive definite and so Q is non-singular. A unique solution thus exists for the system of linear Equation (4.3) and in general for Equation (4.2). And so, the elements of the transformation matrix, B, can be obtained as

or

where Iij are the entries of the information matrix .

Having obtained Θ=(θ1,,θp) the original parameters, Θ=(θ1,,θp) can be obtained recursively asθ1=θ1θj=θj+k=1j-1bj(j-k)θj-k,j=2,,p

or writing this using matrix notation, we haveΘ=B-1Θ

4.2. Block orthogonalization

Let ΘT={Λ,Γ,β} be a set of parameters to be estimated whereΛ=(λ0,λ1,,λq)T;Γ=(γ0,γ1,,γp)T;β=(β1,β1,,βr)T

Suppose further that the set ΘT={Λ,Γ,β} is correlated. Then, we wish to find a new setΘT={Λ,Γ,β} through a linear transformation such that the vector of parameters in Θ is mutually uncorrelated. We allow for correlation, if any, within each set of parameters in Ω.

Further suppose the vector β is the set of parameters of interest. Then, the Gram–Schmidt orthogonalization process computes the new set of parameters in terms of the original parameters as follows:β=βΛ=Λ-B21βΓ=Γ-B31Λ-B32β

where B21,B31 and B32 are matrices with dimensions q×r, p×q and p×r, respectively.

In matrix notation, we writeβΛΓ=100-B2110-B31-B321βΛΓ

orΘ=BΘ

where B is a lower triangular block matrix whose diagonal unit matrix is the transformation matrix chosen such that Λ,Γ,β are mutually uncorrelated and where ΘTΘ is a block diagonal matrix. The Jacobian of the transformation is unity. The only way in which this procedure could break down would be if one of the vectors of Θ is identically zero. From the method of formation of Θ, it is clear that it is a linear combination of the vectors in Θ. If Θ is identically zero, this means that Θ are linearly dependent, thus contradicting the assumption of independence. Extensions to more than three sets of vectors of parameters readily follow.

4.3. Gram–Schmidt–Fisher scoring algorithm

We introduce a modification of the Fisher scoring algorithm incorporating the Gram–Schmidt process to obtain an information matrix which is diagonal and thus ensuring the approximate (near) orthogonality and consequently the stability of the estimates of the new parameters. Since the transformation is linear and injective, the original parameters are readily obtained.

Let l(θ;y) be the log-likelihood in the original Parametrization; then, Fisher scoring algorithm is given by the iterative routine(6) θm+1=θm+[I(θm)]-1U(θm)(6)

where I(θ) is the expected information matrix and U(θ) is the score function.

Suppose θ is transformed to θ through the Gram–Schmidt orthogonalization process. Let B=(bij) be the matrix of transformation, defined as in Equation (3.4). Then, since B is square and non-singular, we can write from Equation (4.5).(7) Bθm+1=Bθm+B[I(θm)]-1[BT(BT)-1]U(θm)Bθm+1=Bθm+{B[I(θm)]-1BT}[(BT)-1U(θm)]θm+1=θm+[I(θm)]-1U(θm)(7)

whereθm=BθmI(θm)=B[I(θm)]-1BTis asymptotically diagonalU(θm)=(BT)-1U(θm)

We shall call Equation (4.6) the Gram–Schmidt–Fisher scoring algorithm.

4.3.1. Algorithmic process

The proposed Gram–Schmit–Fisher scoring algorithm is a 2-stage iterative process that oscillates between Equations (4.3) and (4.6) until convergence and is described iteratively as follows:

1=

Start with an initial estimate of the original parameter, θ, denoted by θ(0).

2=

Estimate B at θ(0) from Equation 4.3 that is bjk=r=1j-1crjIr(j-k),j=2,,p;k=1,,(j-1).

3=

Determine the initial estimate of the orthogonal parameterization, θ(0) from Equation 4.1, i.e. θ1(0)=θ1(0) , θj(0)=θj(0)-k=1j-1bj(j-k)θj-k(0),j=2,,p.

4=

Update θ to obtain a new value for θ(1), that is θ(1)=θ(0)+θ.

5=

Update θ to obtain a new value for θ(1), recursively as θ(1)=B-1θ(1).

6=

Repeat steps 2 through 5 using θ(1) to obtain θ(2).

7=

Stop when |θ(n-1)-θ(n)|<ε.

4.4. Example

We consider a sample problem which concerns inference about the difference between two exponential means. Let Y1 and Y2 be the independent exponential random variables with means ϕ and (ϕ+ψ), respectively. Then, the joint distribution is given by the functionf(y1,y2|ϕ,ψ)=1ϕ(ϕ+ψ)exp-y1ϕ+y2ϕ+ψ

The score vector isU(ϕ,ψ)=lϕlψ=-nϕ-nϕ+ψ+i=1ny1iϕ2+i=1ny2i(ϕ+ψ)2-nϕ+ψ+i=1ny2i(ϕ+ψ)2

The information matrix isIϕψ=iϕϕiϕψiϕψiψψ=n(ϕ+ψ)2n(ϕ+ψ)2n(ϕ+ψ)2nϕ2+n(ϕ+ψ)2

and its inverse, the variance–covariance matrix, isIϕψ=iϕϕiψϕiψϕiψψ=1n2ϕ2+2ϕψ+ψ2-1nϕ2-1nϕ21nϕ2

4.4.1. Cox and Reid approach

Orthogonal parametrization, following Cox–Reid method, requires solving the differential equation1(ϕ+ψ)+1ϕ2ϕψ=-1ϕ+ψ

This can be solved by separation of variables, leading to a(λ)=ϕ(ψ+ϕ)/(ψ+2ϕ), where a(λ) is an arbitrary function of λ. Cox–Reid suggest setting a(λ)=eλ as a suitable choice. Clearly, this does not produce a unique solution for ϕ, regardless of the parametrization of a(λ). Thus, different reparametrizations may lead to different modified likelihoods, so that a Cox–Reid estimator may not exist or there may be many of them.

4.4.2. Gram–Schmidt approach

The proposed Gram–Schmidt approach requires seeking a transformation (ψ,ϕ) to (ψ,λ(ψ,ϕ)) such that λ=ϕ-b21ψ and solving the equationiϕψ=b21iψψ

From the variance–covariance matrix, we obtainb21(ϕ,ψ)=-ϕ22ϕ2+2ϕψ+ψ2λ=ϕ+ϕ22ϕ2+2ϕψ+ψ2ψ

Just like the Cox–Ried approach, it is impractical to obtain or express ϕ uniquely in terms of ψ and λ and subsequently formulate a modified approximate orthogonal joint distribution function. Therefore, one could proceed iteratively, using the proposed Gram–Schmidt–Fisher scoring algorithm.

5. Fitting the disposition model

We will present procedures for estimating the parameters in the correlated model Equation (2.1). The calculations are standard, and so we will only outline the results. We write the likelihood of the joint distribution of the ith group asLi(Yi;θ)=(1-αi)j=1ni(1-yij)+αiL0i(θ|yi)

whereL0i(θ|yi)=j=1niδijyij(1-δij)1-yijδij(θ)=11+exp{-[M(Zi)+D(Zi)+W(Xij)]}αi(θ)=1+exp[-(M(Zi)+D(Zi)]1+exp[-M(Zi)]

LetΥij(θ)=M(Zi)+D(Zi)+W(Xij)

and define the followingΥij(1)=θΥij(θ)andΥij(2)=2θTθΥij(θ)

The contribution of the i-th group to the log-likelihood is the termlogLi(θ|yi)=log1-α(θ)(1-yij)+α(θ)L0i(θ|yi)

and the contribution to the score function is the termUi=θlogLi(θ)=Ai(θ)αi+Di(θ)U0i(θ|y),

whereαi(θ|y)=θlogα(θ)=δi0(1-αi)θMi(Z)-(1-δi0)θDi(Z)δi0=11+exp{-[M(Zi)+D(Zi)]}Ai(θ)=αi(θ)L0i(θ|y)-(1-yij)/Li,Di(θ)=αiL0iLiU0i(θ|yi)=θlogL0i(θ|y)=j=1ni(yij-δij)Υij(1);

Setting U(θ)=i=1NUi=0, we obtain the score equations. Closed-form solutions are not possible and so the equations are solved by iterative procedures to obtain the maximum likelihood estimates of the parameters.

The contribution of the i-th group to the Hessian matrix is the termHi(θ)=2li(θ)θθt=j=1ni[(yij-δij)Υij(2)-δij(1-δij)Υij(1)Υij(1)T].

Estimates of the parameter vector can be obtained by the Newton–Raphson iteration routine, which is given by the updating the formulaθs+1=θs-[H(θs)]-1U(θs)

where θs is the estimate of the sth iteration.

The contribution of the i-th group to the Fisher information matrix is the termIi(θ)=αiI0i+AiαiαT+Di[αU0iT+U0iαT]-Di(1-αi)U0iU0iT

whereI0i=j=1niδij(1-δij)Υij(1)Υij(1)T

and where Ai,Di and U0i are evaluated at y=(y1,y2,,yn)=0. The estimates can be alternatively obtained by the Fisher scoring method which is given byθs+1=θs+[I(θs)]-1U(θs)

The asymptotic variance–covariance matrix of the parameter estimates, C(θ)=[I(θ)]-1, is the inverse of the information matrix. Thus, the transformation matrix for use of the proposed Gram–Schmidt–Fisher scoring algorithm can be obtained as described from Equation 4.4. Specifically, the transformation matrix, a lower triangular matrix with ones along the diagonal, is B=(bjk), where bjk=r=1j-1crjIr(j-k),j=2,,p;k=1,,(j-1). and where (crj) and (Ir(j-k)) are entries of the variance–covariance matrix ,C(θ), and the information matrix,[I(θ)],respectively.

The parameter estimates can subsequently be obtained by the Gram–Schmidt–Fisher scoring algorithm given byθs+1=θs+[I(θs)]-1U(θs)

whereθs=Bθs;U(θs)=(BT)-1U(θs);I(θs)=B[I(θs)]-1BT

5.1. Application to oesophageal cancer in Chinese families

This application involves the study of oesophageal cancer in 2951 nuclear families collected in Yangcheng County, Shanxi Province in China (Kwagyan, kwagyan). In this study, we consider as a group the nuclear family unit. The outcome variable is whether an individual is affected with oesophageal cancer or not. The objective of the study is to assess the presence and aggregation of oesophageal cancer in these families. Table summarizes the distribution of number of affected individuals by family size. Of the 2951 families, 1580 (53%) had no affected individuals. The combined total number of individuals from the studied families was 14310, with mean ± sd age of 48.26 ± 18.17 years. Males comprise 56.4 and 25.4% indicated drinking alcohol.

Table 1. Distribution of family size by number of affecteds

The respondent within a family has correlated outcomes which are influenced in part or wholly by the group variables as well as the variables on the individual respondent. The main objective here is to assess the presence of familial aggregation of oesophageal cancer adjusting for measured risk factors: sex, age and alcohol consumption. Here, the model for the regression analysis of disposition is parametrization asM(Z)=λ,D(Z)=γandW(X)=β1sex+β2alcohol+β3age

In this application, the vector of parameters β=(β1,β2,β3) is the set of parameters of interest. Computations are performed using the computer program we developed CORRDAT (Bonney, Kwagyan, Slater, & Slifker, bonney5), which was linked with the likelihood optimization software MULTIMAX (Bonney, Kwagyan, & Apprey, bonney4). Computations can also be accomplished in MATLAB. Table gives estimates of the correlation matrices. The correlations between the original parameters are quite high compared to those of the orthogonal parameters. In particular, the correlation between λ and γ is -0.179; the correlation between λ and β1 is -0.414; and that between λ and β3 is high, -0.841. The correlations between the orthogonal parameters are near zero or nonexistent. The correlation between λ is and γ is -0.026; the correlation between λ and β1 is now low, 0.004 and that between λ and β3 is also low, 0.0188.

As expected, the orthogonal likelihood converged in fewer iterations than the non-orthogonal one—the orthogonal likelihood converged after 17 iterations; the non-orthogonal one converged after 56 iterations. Estimates of the parameters are summarized in Table . The maximum likelihood estimate of the relative dependence parameter, γ^=0.577 with a corresponding asymptotic 95% confidence interval of (0.529, 0.625) suggests that there is significant familial aggregation of oesophageal cancer in the families sampled. Sex and age have a positive significance, while alcohol was negative. The results suggest that males are at a higher risk of getting oesophageal cancer than females and also that it is more prevalent in older people. The negative effect of alcohol seems to suggest that it has the propensity to lower the risk of oesophageal cancer in the Chinese population studied, perhaps if drank in moderation. In summary, we conclude that oesophageal cancer aggregates in the families sampled.

Table 2. Correlation matrix from original and orthogonal parametrization. Fisher scoring algorithm used for original parametrization and Gram–Schmidt–Fisher scoring algorithm used of orthogonal parametrization

Table 3. Regression analysis of disposition to oesophageal cancer. Fisher scoring algorithm used for original parametrization and Gram–Schmidt–Fisher scoring algorithm for orthogonal parametrization

6. Conclusions

Parameter orthogonalization is used as an aid in computation, approximation, interpretation and elimination or reduction of the effect of nuisance parameters. The resulting algorithm that we have proposed based on the Gram–Schmidt orthogonalization process is computationally feasible. Unlike the approach of Cox–Reid which requires solving a system of differential equations to obtain orthogonality, the proposed method only requires solving a system of linear equations. Our approach has some similarities with the conjugate gradient algorithm, but is distinctly different from it. The conjugate gradient method is an optimization routine, whereas the method we have proposed is principally an approximate orthogonalization technique combined with a maximization algorithm to aid in parameter estimation. Amari (amari1) claims that global orthogonalization is possible only in a special case. Our method allows for global orthogonalization, if the information matrix can be found or estimated. The approach we have proposed is based on exact calculations. The transformation is surjective, that is the original parameters can be sufficiently obtained. Clearly, the process can be cumbersome if a large number of parameters must be accurately estimated because of the inversion of the information matrix at every stage to obtain the covariance matrix and subsequently the transformation matrix. The advantages, however, are that convergence is generally rapid in iteration times, sure and accurate. Block parametrization of parameters would be desirable if interest is in sets of parameters and where the dimension of the parameter space is large. The work we have presented in this article is the first attempt to consider the use of Gram–Schmidt process for estimation of the parameters. It is possible, however, that closer scrutiny, practical considerations, numerical studies for numerical stability and properties, and simulation studies to evaluate and compare convergence times would suggest modifications and refinements to the methods we have discussed.

In conclusion, the algorithm that we have proposed provides an exceedingly convenient method for multi-parameter statistical problems first to reduce numerical difficulties and second improve accuracy in parameter estimation. The method would be efficient for use of function minimization in both linear and non-linear maximum likelihood estimations and particularly useful for small parameter space estimation problems.

Additional information

Funding

The work was supported in part from NIH/NCATS [grant number UL1TR000101 previously UL1RR031975] and NIH/NIMHHD [grant number G12MD007597].

Notes on contributors

John Kwagyan

Dr John Kwagyan is a mathematician and a public health and medical researcher. His research interests include mathematical and statistical modelling of correlated data, clinical trials, survival models, big data analytics, statistical genetics and pharmacokinetic & pharmacodynamics modeling. He is currently the director of Biostatistics, Epidemiology and Research Design, of the Georgetown-Howard University Center for Clinical and Translational Science, Howard University College of Medicine and adjunct professor in the Department of Mathematics, Howard University, Washington, DC, USA.

References

  • Amari, S. I. (1982). Differential geometry of curved exponential families- curvatures and information loss. Annals of Statistics, 10, 357–85.
  • Amari, S. I. (1985). Differential geometry in statistics. New York, NY: Springer Verlag.
  • Bonney, G. E., & Kissling, G. E. (1986). Gram--Schmidt orthogonalization of multinormal variates: Applications in genetics. Biometrical Journal, 28, 417–425.
  • Bonney, G. E. (1995). Some new results on regressive models in family studies. Proceedings of the Biometrics Section, American Statistical Association, 177–182.
  • Bonney, G. E. (1998a). Regression analysis of disposition to correlated binary outcomes (Scientific Report). Philadelphia, PA: Fox Chase Cancer Center.
  • Bonney, G. E. (1998b). Regression analysis of disposition to correlated binary outcomes. Unpublished Manuscript.
  • Bonney, G. E., Kwagyan, J., & Apprey, V. (1997). MULTIMAX-A computer package for MULTI-objective MAXimization with applications in genetics and epidemiology. The American Journal of Human Genetics, 61, 447.
  • Bonney, G. E., Kwagyan, J., Slater, E., & Slifker, M. (1997). CORRDAT-A computer package for CORRelated DATa. The American Journal of Human Genetics, 61, A194.
  • Cox, D. R., & Reid, N. (1987). Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society: Series B, 49, 1–39.
  • Cox, D. R., & Reid, N. (1989). On the stability of maximum likelihood estimators of orthogonal parameters. Canada Journal of Statistics, 17, 229–233.
  • Clayton, D. G. (1971). Gram-Schmidt orthogonalization. Journal of the Royal Statistical Society: Series C (Applied Statistics), 20, 335–338.
  • Godambe, V. P. (1991). Orthogonality of estimating functions and nuisance parameters. Biometrika, 78, 143–151.
  • Godambe, V. P., & Thompson, M. E. (1989). An extension of quasi-likelihood estimation (with discussion). Journal of Statistical Planning and Inference, 22, 137–72.
  • Hills, S. E. (1987). Parameter orthogonality and approximate conditional inference [Discussion]. Journal of the Royal Statistical Society: Series B, 49, 1–39.
  • Hurlimann, W. (1992). On parameter orthogonality of the mean. Statistical Papers, 33, 69–74.
  • Kalbliesch, J. D., & Sprott, D. A. (1970). Application of likelihood methods to models involving large number of parameters. Journal of the Royal Statistical Society. Series B (Methodological), 32, 175–208.
  • Kwagyan, J. (2001). Further Investigation of the disposition Model for correlated binary outcomes ( PhD Thesis). Philadelphia, PA: Department of Statistics, Temple University.
  • Kwagyan, J., Apprey, V., & Bonney, G. E. (2001). Parameter orthogonalization in genetic analysis. Genetic Epidemiology, 21, 163 IGES 059.
  • Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.
  • Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, 1088–1048.
  • Ross, G. J. S. (1970). The efficient use of function minimization in non-linear maximum-likelihood estimation. Applied Statistics, 19, 205–221.
  • Willmot, G. E. (1988). Parameter orthogonality for a family of discrete distributions. The Journal of the Acoustical Society of America, 83, 517–521.