![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
A useful way of approaching a statistical problem is to consider whether the addition of some missing information would transform the problem into a standard form with a known solution. The EM algorithm (Dempster, Laird, and Rubin 1977), for example, makes use of this approach to simplify computation. Occasionally it turns out that knowledge of the missing values is not necessary to apply the standard approach. In such cases the following simple logical argument shows that any optimality properties of the standard approach in the full-information situation generalize immediately to the approach in the original limited-information situation: If any better estimate were available in the limited-information situation, it would also be available in the full-information situation, which would contradict the optimality of the original estimator. This approach then provides a simple proof of optimality, and often leads directly to a simple derivation of other properties of the solution. The approach can be taught to graduate students and theoretically-inclined undergraduates. Its application to the elementary proof of a result in linear regression, and some extensions, are described in this paper. The resulting derivations provide more insight into some equivalences among models as well as proofs simpler than the standard ones.
1. Introduction
1 Assume the linear regression model
where is an
random vector of observations on the dependent variable in the n experimental units,
is a fixed, known
matrix with all elements of the first column equal to one and of rank p + 1,
is a fixed, unknown (p + 1)-vector of coefficients, and
is a random
vector with expected value zero and covariance matrix
Under model (1), the best linear unbiased estimator (BLUE) of the vector
is
(2)
2 Let be the vector of coefficients excluding the constant additive term. A well-known fact, stated in most regression textbooks, is that the expression for estimating β can be obtained from the inversion of a
rather than a
matrix by expressing
and in
deviation form
(i.e., subtracting column means), and using an expression of the same form as (2).
3 Let
be the mean of the n components of
, and let
be the
row vector with elements
, where
is the mean of the n components of column i of the matrix
Let
be an
vector of ones. Then
, with elements 2 to (p + 1) of
, can be expressed as
(3)
where is the
matrix consisting of columns 2 to (p + 1) of
(the first column of is
identically zero and is dropped) and
The proof of this result is not immediately obvious because the covariance matrix of y, given
, is no longer
but is
(4)
where J is an matrix with all elements equal to 1. Draper and Smith (1998, pp. 27-28) and Daniel and
Wood (1980, pp. 13-14) state the result without proof. Sen and Srivastava (1990, pp. 42 and 146) ask for the proof in a problem, and suggest using a generalized inverse (since the covariance matrix (4) is singular). CitationArnold (1981), CitationGraybill (1976), CitationSearle (1971), and CitationSeber (1977) have relatively long proofs involving partitioning of matrices.
4 This paper presents a proof that involves only elementary facts that are usually presented in a first introduction to regression. The method of proof involves consideration of missing information that, if known, would put the equation to be solved into a standard regression form.
2. Proof of the Equivalence of Raw Form and Deviation Form Solutions
Under Model (1), noting that
(5)
where is the mean of the n components of
, and subtracting (5) from (1) yields
(6)
where X is the matrix defined below EquationEquation (3)
(3) . Suppose
were known. Adding
to both sides of (6) yields the equation
(7)
where . Note that
and
, so (7) is in the form (1), but with
and
replaced by
and X, the latter in deviation form. Thus, the BLUE of
is
(8)
The last equality follows because, although , which appears in (8), is unknown, each element i of the vector
is in the form
,
, since X is in deviation form. Then
is the BLUE of
, as is
omitting the first element. Therefore,
, without the intercept element, and
must be equivalent.
3. Extensions
6 Note that, instead of adding to both sides of (6) in Section 2, the realization of any random variable independent of
with mean zero and variance
could be added to each observation, and the proof would proceed in the same way. By combining the method of Section 2 with the addition of an external variable independent of
, a further equivalence can be demonstrated.
Theorem: Consider a model of the form (1) but in which the covariance matrix of is
(9)
The BLUE of β in the model (1) but with the covariance matrix of given by EquationEquation (9)
(9) is the same as the BLUE of β in the model (1) with the covariance matrix of
(note that β is
without the intercept term). Furthermore, the covariance matrix of the BLUE of
with error covariance matrix (9) is
times as great as the covariance matrix of the BLUE of
with error covariance matrix
.
Proof: Three cases will be considered separately.
i. P=0. In this case, the model reduces to (1) with error covariance matrix , so no proof is needed.
ii. p<0. As in Section 2, convert the model (1) into the form (7) by subtracting the expression for from (1) and adding
to both sides. The covariance matrix of
is unchanged from that of y+ by this transformation; i.e., it is given by (9). Now add a random vector
to both sides of (7), where
is the realization of a real-valued random variable independent of ε and with expected value zero and variance . The Equationequation (7)
(7) becomes
(10)
or
(11)
where and
. The covariance matrix of
is
, where
, and therefore Model (11) is in the same form as Model (7), with a covariance matrix
times as large, so the theorem follows from the previous one. A proof along these lines was
used in Shaffer (1981, p. 609).
iii. p>0 . The proof extends to the case p>0 by first converting the model to (7) with X in deviation form as in (ii). Now express as
, where
is a random
vector with mean zero and covariance matrix
, and
is a real-valued random variable with mean zero and variance
. Subtract
from both sides of (7), so that the resulting model is in the same form as Model (7), with a covariance matrix
times as large, as in (ii), but now with
. The proof proceeds as in (ii).
7 The proof for requires a further comment. Up to now, there has been no reference to the distribution of the errors, except for expected values and variances. As is well known, the results above are distribution- free, requiring only that
be finite. However, the proof above for
, in which a random vector is expressed as the sum of two others, requires some distributional assumption, since not all random vectors can be expressed in that form.
8 Assume two models A and B with the same regression coefficients but different error covariance matrices. Because the BLUE, given any model, has a fixed form, depending only on the covariance matrix of the errors and not on other features of the error distribution (of course with means of errors equal to zero), it follows that if A and B (with the specified error covariance matrices above) have the same BLUEs under any error distribution, they have the same BLUEs for all error distributions. Therefore, it can be assumed, without loss of generality, that the error distributions are normal. In that case, the proof of the theorem for can be carried out, since a normal random vector with the given covariance matrix can be decomposed as required in the proof.
9 Note that another proof, in the same spirit, that combines negative and positive values of can be obtained by transforming the model into the deviation form (6), and then adding the realization of a single independent random variable with variance
to each observation. The proof using this alternative approach appears to be more compact but involves somewhat more matrix manipulation than the previous one. Both proofs start with the deviation form (6). In order to determine the variance of the single independent variable value to be added to each observation using this alternative proof, it is necessary to derive the covariance matrix of
, which requires a fair amount of calculation because of the nonzero covariances in (9). In the original proof given, by starting with (6) and adding (
) to both sides, it is immediately clear that the equation is in the form (7) with the covariance matrix unchanged from the original form (9). The single
further step, required in both proofs, involves adding (or subtracting) the realization of a random variable, independent of , to each observation. The only knowledge needed for this latter step is that the variance of that random variable is added to every term in the covariance matrix; this is easy to show. However, the alternative proof, although somewhat more complex than the first one given, is nonetheless simpler than proofs presently available in the literature.
10 CitationMcElroy (1967) proved that the error covariance matrix (9) is a necessary and sufficient condition for (2) to be the BLUE of in Model (1), for
. McElroy’s result is more general than the result in this theorem in that it includes the estimate of
and includes necessity as well as sufficiency of (9).
On the other hand, the result in this theorem is more general than McElroy’s result in that the singular matrices resulting when and when p=1 are included in the range of
.
11 An interesting insight follows from considering the case . In this case, there is (with probability one) a single realization
of a random variable added to the expected value for each observation. Then, in the form (6), each element of
is exactly equal to its expected value, so the covariance matrix of y is the null matrix, and β can be calculated exactly.
4. Conclusion
12 The development in Section 2 provides a simple proof that estimators of the coefficient vector in a standard linear model, excluding the intercept term, can be obtained by expressing the sample values of the predictors and predicted variable in deviation form and using the standard equation for the estimator of in a model without an intercept. From a geometric point of view, note that the regression plane goes through the point
, and centering shifts the plane to go through the origin but doesn’t change the slopes. (Of course the solution in this paper is for a model with an intercept, and should not be confused with the solution for a model that assumes an intercept of zero, for which the estimators and their properties are different.) The proof proceeds by noting that the addition of an unknown value
would put the relevant expressions into standard form. The same approach, with the addition of an external random vector, leads to a simple proof that the BLUE of
when the error covariance matrix is of the form
is the same as when it is of the standard form
, and yields a simple expression for the comparative variances of the estimators. It follows also from the proof that the results hold even if the covariance matrix of
is singular. Neither extensive matrix manipulation nor knowledge of generalized inverses is necessary. The results generalize under appropriate conditions to the correlation model (CitationArnold 1981, Shaffer 1991), in which the elements of X are realizations of random variables.
Acknowledgments
We would like to thank Katherine Halvorsen for her valuable comments which helped to improve the presentation of this paper.
References
- Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, New York: Wiley-Interscience.
- Daniel, C. and Wood, F. S. (1980), Fitting Equations to Data: Computer Analysis of Multifactor Data for Scientists and Engineers (2nd ed.), New York: Wiley-Interscience.<pub>
- Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood From Incomplete Data Via the EM Algorithm,” Journal of the Royal Statistical Society, Ser. B, 39, 1–22.
- Draper, N. R., and Smith, H. (1998), Applied Regression Analysis (3rd ed.), New York: Wiley.
- Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate, MA: Duxbury Press.
- McElroy, F. W. (1967), “A Necessary and Sufficient Condition That Ordinary Least Squares Estimators Be Best Linear Unbiased,” Journal of the American Statistical Association, 62, 1302–1304.
- Searle, S. R. (1971). Linear Models, New York: Wiley.
- Seber, G. A. F. (1977), Linear Regression Analysis, New York: Wiley.
- Sen, A., and Srivastava, M. (1990), Regression Analysis: Theory, Methods, and Applications, New York: Springer-Verlag.
- Shaffer, J. P. (1981), “The Analysis of Variance Mixed Model With Allocated Observations: Application to Repeated Measurement Designs,” Journal of the American Statistical Association, 76, 607–611.
- Shaffer, J. P. (1991), “The Gauss-Markov Theorem and Random Regressors,” The American Statistician, 45, 269–273.