73
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A Novel Method of Proof With an Application to Regression

&

Abstract

A useful way of approaching a statistical problem is to consider whether the addition of some missing information would transform the problem into a standard form with a known solution. The EM algorithm (Dempster, Laird, and Rubin 1977), for example, makes use of this approach to simplify computation. Occasionally it turns out that knowledge of the missing values is not necessary to apply the standard approach. In such cases the following simple logical argument shows that any optimality properties of the standard approach in the full-information situation generalize immediately to the approach in the original limited-information situation: If any better estimate were available in the limited-information situation, it would also be available in the full-information situation, which would contradict the optimality of the original estimator. This approach then provides a simple proof of optimality, and often leads directly to a simple derivation of other properties of the solution. The approach can be taught to graduate students and theoretically-inclined undergraduates. Its application to the elementary proof of a result in linear regression, and some extensions, are described in this paper. The resulting derivations provide more insight into some equivalences among models as well as proofs simpler than the standard ones.

1. Introduction

1 Assume the linear regression model

y + = X + β + + ε , (1)

where y + is an n × 1 random vector of observations on the dependent variable in the n experimental units, X + is a fixed, known n × ( p + 1 ) matrix with all elements of the first column equal to one and of rank p + 1, β + = ( β 0 , , β p ) ' is a fixed, unknown (p + 1)-vector of coefficients, and ε is a random n × 1 vector with expected value zero and covariance matrix σ 2 I Under model (1), the best linear unbiased estimator (BLUE) of the vector β + is β ^ + = ( X ' + X + ) 1 X + y ' + . (2)

2 Let β = ( β 1 , , β p ) ' be the vector of coefficients excluding the constant additive term. A well-known fact, stated in most regression textbooks, is that the expression for estimating β can be obtained from the inversion of a p × p rather than a ( p + 1 ) × ( p + 1 ) matrix by expressing x + and in y + deviation form

(i.e., subtracting column means), and using an expression of the same form as (2).

3 Let y ¯ + be the mean of the n components of y + , and let X ¯ + be the 1 × ( p + 1 ) row vector with elements

x ¯ + , ( 0 ) , x ¯ + , ( 1 ) , , x ¯ + , ( p ) , where x ¯ + , ( i ) is the mean of the n components of column i of the matrix x + Let 1 be ann × 1 vector of ones. Then β ^ , with elements 2 to (p + 1) of β ^ + , can be expressed as β ^ = ( X ' X ) 1 X ' y , (3)

where X is the n × p matrix consisting of columns 2 to (p + 1) of ( X + 1 X ¯ + ) (the first column of is ( X + 1 X ¯ + ) identically zero and is dropped) andy = ( y + 1 y ¯ + ) . The proof of this result is not immediately obvious because the covariance matrix of y, given X , is no longer σ 2 I but is y = σ 2 [ I ( 1 / n ) J ] (4)

where J is an n × n matrix with all elements equal to 1. Draper and Smith (1998, pp. 27-28) and Daniel and

Wood (1980, pp. 13-14) state the result without proof. Sen and Srivastava (1990, pp. 42 and 146) ask for the proof in a problem, and suggest using a generalized inverse (since the covariance matrix (4) is singular). CitationArnold (1981), CitationGraybill (1976), CitationSearle (1971), and CitationSeber (1977) have relatively long proofs involving partitioning of matrices.

4 This paper presents a proof that involves only elementary facts that are usually presented in a first introduction to regression. The method of proof involves consideration of missing information that, if known, would put the equation to be solved into a standard regression form.

2. Proof of the Equivalence of Raw Form and Deviation Form Solutions

Under Model (1), noting that 1 y ¯ + = 1 ( X ¯ + β + + ε ¯ ) , (5)

where ε ¯ is the mean of the n components of ε , and subtracting (5) from (1) yields y = X β + ε 1 ε ¯ , (6)

where X is the n × p matrix defined below EquationEquation (3). Suppose ε ¯ were known. Adding 1 ε ¯ to both sides of (6) yields the equation y * = X β + ε , (7)

where y * = y + 1 ε ¯ . Note that E ( y * ) = X β and y * = σ 2 I , so (7) is in the form (1), but with y + and x + replaced by y * and X, the latter in deviation form. Thus, the BLUE of β is β ^ = ( X ' X ) 1 X ' y * = ( X ' X ) 1 X ' ( y + 1 ε ¯ ) = ( X ' X ) 1 X ' y (8)

The last equality follows because, although ε ¯ , which appears in (8), is unknown, each element i of the vectorX ' ( 1 ε ¯ ) is in the form ε ¯ ( i = 1 n x i j ) = 0 , j = 1 , , p , since X is in deviation form. Then β ¯ is the BLUE of β , as is β ^ + omitting the first element. Therefore, β ^ + , without the intercept element, and β ¯ must be equivalent.

3. Extensions

6 Note that, instead of adding ( 1 ε ¯ ) to both sides of (6) in Section 2, the realization of any random variable independent of ε with mean zero and variance σ 2 / n could be added to each observation, and the proof would proceed in the same way. By combining the method of Section 2 with the addition of an external variable independent of ε , a further equivalence can be demonstrated.

Theorem: Consider a model of the form (1) but in which the covariance matrix of ε is ε = σ 2 [ ( 1 p ) I + p J ] , 1 / ( n 1 ) p 1 (9)

The BLUE of β in the model (1) but with the covariance matrix of ε given by EquationEquation (9) is the same as the BLUE of β in the model (1) with the covariance matrix of ε = σ 2 I (note that β is β + without the intercept term). Furthermore, the covariance matrix of the BLUE of β with error covariance matrix (9) is ( 1 ρ ) times as great as the covariance matrix of the BLUE of β with error covariance matrix σ 2 I .

Proof: Three cases will be considered separately.

i. P=0. In this case, the model reduces to (1) with error covariance matrix σ 2 I , so no proof is needed.

ii. p<0. As in Section 2, convert the model (1) into the form (7) by subtracting the expression for 1 y ¯ + from (1) and adding 1 ε ¯ to both sides. The covariance matrix of y * is unchanged from that of y+ by this transformation; i.e., it is given by (9). Now add a random vector 1 δ to both sides of (7), where δ

is the realization of a real-valued random variable independent of ε and with expected value zero and variance σ 2 | p | . The Equationequation (7) becomes y * + 1 δ = X β + ε + 1 δ (10)

or z = X β + η (11)

where z = y * + 1 δ and η = ε + 1 δ . The covariance matrix of η is σ 2 ( 1 p ) I = τ 2 I , where τ 2 = ( 1 p ) σ 2 , and therefore Model (11) is in the same form as Model (7), with a covariance matrix ( 1 p ) times as large, so the theorem follows from the previous one. A proof along these lines was

used in Shaffer (1981, p. 609).

iii. p>0 . The proof extends to the case p>0 by first converting the model to (7) with X in deviation form as in (ii). Now express ε as η + 1 δ , where η is a random n × 1 vector with mean zero and covariance matrix σ 2 ( 1 p ) I , and δ is a real-valued random variable with mean zero and variance σ 2 p . Subtract 1 δ from both sides of (7), so that the resulting model is in the same form as Model (7), with a covariance matrix ( 1 p ) times as large, as in (ii), but now with z = y * 1 δ . The proof proceeds as in (ii).

7 The proof for p > 0 requires a further comment. Up to now, there has been no reference to the distribution of the errors, except for expected values and variances. As is well known, the results above are distribution- free, requiring only that σ 2 be finite. However, the proof above for p > 0 , in which a random vector is expressed as the sum of two others, requires some distributional assumption, since not all random vectors can be expressed in that form.

8 Assume two models A and B with the same regression coefficients but different error covariance matrices. Because the BLUE, given any model, has a fixed form, depending only on the covariance matrix of the errors and not on other features of the error distribution (of course with means of errors equal to zero), it follows that if A and B (with the specified error covariance matrices above) have the same BLUEs under any error distribution, they have the same BLUEs for all error distributions. Therefore, it can be assumed, without loss of generality, that the error distributions are normal. In that case, the proof of the theorem for p > 0 can be carried out, since a normal random vector with the given covariance matrix can be decomposed as required in the proof.

9 Note that another proof, in the same spirit, that combines negative and positive values of ρ can be obtained by transforming the model into the deviation form (6), and then adding the realization of a single independent random variable with variance σ 2 ( 1 p ) / n to each observation. The proof using this alternative approach appears to be more compact but involves somewhat more matrix manipulation than the previous one. Both proofs start with the deviation form (6). In order to determine the variance of the single independent variable value to be added to each observation using this alternative proof, it is necessary to derive the covariance matrix of ( ε 1 ε ¯ ) , which requires a fair amount of calculation because of the nonzero covariances in (9). In the original proof given, by starting with (6) and adding (1 ε ¯ ) to both sides, it is immediately clear that the equation is in the form (7) with the covariance matrix unchanged from the original form (9). The single

further step, required in both proofs, involves adding (or subtracting) the realization of a random variable, independent of ε , to each observation. The only knowledge needed for this latter step is that the variance of that random variable is added to every term in the covariance matrix; this is easy to show. However, the alternative proof, although somewhat more complex than the first one given, is nonetheless simpler than proofs presently available in the literature.

10 CitationMcElroy (1967) proved that the error covariance matrix (9) is a necessary and sufficient condition for (2) to be the BLUE of β + in Model (1), for 1 / ( n 1 ) < p < 1 . McElroy’s result is more general than the result in this theorem in that it includes the estimate of β 0 and includes necessity as well as sufficiency of (9).

On the other hand, the result in this theorem is more general than McElroy’s result in that the singular matrices resulting when p = 1 / ( n 1 ) and when p=1 are included in the range of p .

11 An interesting insight follows from considering the case p = 1 . In this case, there is (with probability one) a single realization ε of a random variable added to the expected value for each observation. Then, in the form (6), each element of y is exactly equal to its expected value, so the covariance matrix of y is the null matrix, and β can be calculated exactly.

4. Conclusion

12 The development in Section 2 provides a simple proof that estimators of the coefficient vector in a standard linear model, excluding the intercept term, can be obtained by expressing the sample values of the predictors and predicted variable in deviation form and using the standard equation for the estimator of β in a model without an intercept. From a geometric point of view, note that the regression plane goes through the point ( x ¯ + , ( 1 ) , , x ¯ + , ( p ) , y ¯ + ) , and centering shifts the plane to go through the origin but doesn’t change the slopes. (Of course the solution in this paper is for a model with an intercept, and should not be confused with the solution for a model that assumes an intercept of zero, for which the estimators and their properties are different.) The proof proceeds by noting that the addition of an unknown value ( ε ¯ ) would put the relevant expressions into standard form. The same approach, with the addition of an external random vector, leads to a simple proof that the BLUE of β when the error covariance matrix is of the form σ 2 [ ( 1 ρ ) I + ρ J ] is the same as when it is of the standard form σ 2 I , and yields a simple expression for the comparative variances of the estimators. It follows also from the proof that the results hold even if the covariance matrix of y + is singular. Neither extensive matrix manipulation nor knowledge of generalized inverses is necessary. The results generalize under appropriate conditions to the correlation model (CitationArnold 1981, Shaffer 1991), in which the elements of X are realizations of random variables.

Acknowledgments

We would like to thank Katherine Halvorsen for her valuable comments which helped to improve the presentation of this paper.

References

  • Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, New York: Wiley-Interscience.
  • Daniel, C. and Wood, F. S. (1980), Fitting Equations to Data: Computer Analysis of Multifactor Data for Scientists and Engineers (2nd ed.), New York: Wiley-Interscience.<pub>
  • Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood From Incomplete Data Via the EM Algorithm,” Journal of the Royal Statistical Society, Ser. B, 39, 1–22.
  • Draper, N. R., and Smith, H. (1998), Applied Regression Analysis (3rd ed.), New York: Wiley.
  • Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate, MA: Duxbury Press.
  • McElroy, F. W. (1967), “A Necessary and Sufficient Condition That Ordinary Least Squares Estimators Be Best Linear Unbiased,” Journal of the American Statistical Association, 62, 1302–1304.
  • Searle, S. R. (1971). Linear Models, New York: Wiley.
  • Seber, G. A. F. (1977), Linear Regression Analysis, New York: Wiley.
  • Sen, A., and Srivastava, M. (1990), Regression Analysis: Theory, Methods, and Applications, New York: Springer-Verlag.
  • Shaffer, J. P. (1981), “The Analysis of Variance Mixed Model With Allocated Observations: Application to Repeated Measurement Designs,” Journal of the American Statistical Association, 76, 607–611.
  • Shaffer, J. P. (1991), “The Gauss-Markov Theorem and Random Regressors,” The American Statistician, 45, 269–273.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.