1,946
Views
18
CrossRef citations to date
0
Altmetric
Original Articles

A simple way to deal with multicollinearity

Pages 1893-1909 | Received 24 Sep 2011, Accepted 01 May 2012, Published online: 29 May 2012
 

Abstract

Despite the long and frustrating history of struggling with the wrong signs or other types of implausible estimates under multicollinearity, it turns out that the problem can be solved in a surprisingly easy way. This paper presents a simple approach that ensures both statistically sound and theoretically consistent estimates under multicollinearity. The approach is simple in the sense that it requires nothing but basic statistical methods plus a piece of a priori knowledge. In addition, the approach is robust even to the extreme case when the a priori knowledge is wrong. A simulation test shows astonishingly superior performance of the method in repeated samples comparing to the OLS, the Ridge Regression and the Dropping-Variable approach.

Acknowledgements

I am indebted to David Kemme, George Renko, Ardell J. Miller and Cyrus Pardis for stimulating discussions that have substantially helped in the progress of the study. I thank Cyrus J. Pardis for his kindness and help in improving the writing of the paper. I am especially thankful to Peter Kennedy for he has offered a great deal of time and invaluable comments and suggestions on the paper. The comments and suggestions from three anonymous referees and the editor of Journal of Applied Statistics have largely lifted the quality of the paper and are greatly appreciated. All errors are mine.

Notes

Dedicated to Mr Muhua Chen, my beloved father in Heaven.

The RC does not require a precise and accurate piece of a priori information about the true coefficients. Section 6 illustrates how simple and vague information can deliver astonishingly good estimates.

The idea of this theorem is benefited from Kennedy Citation9 who provides an insightful interpretation to the confidence region and a helpful clue that the region can be expressed in terms of R 2.

In fact the RC is not only open to all “a priori” information, but open to all “ex post” information – information obtained after the regression results, including the ellipsoid, are known to the researcher. This gives greater flexibility to the researcher for improving his estimates. For example, the researcher may have a clearer and more realistic idea about what the reasonable coefficients should be like after s/he sees all the regression results. In this sense, the approach is open to all forms of information.

It has been assumed that the data used are accurate. However, if this assumption is in doubt, then a re-investigation of the data would be recommended.

It is not necessarily the case that only the last coefficient can be precisely estimated. Suppose, for instance, we have 10 collinear regressors but the last five, despite that they are highly correlated with the first five, are not much correlated with one another within their own group. Then after estimates of the first five are pinned down, the last five can all be precisely estimated. So, exactly how many other coefficients can be precisely estimated depends on the degree of the “residual” correlation among the other coefficients.

For instance, the loss in R 2 within the 95% confidence region is not necessarily larger than 0.01 for a sample of less than 650 observations. The actual loss also depends on other parameters such as and by exactly where within the confidence region one finds the plausible coefficients.

The OLS produces the best-fit coefficients rather than the true coefficients.

See discussion in question numbers 2 and 3.

For example, it is not easy to tell which coefficient value is reasonable for one individual regressor out of a group if the reasonability of the group of regressors can only be determined as a whole simultaneously, e.g. polynomial of a variable. The Staged Regression approach is more convenient in handling these cases.

It is arguable whether a coefficients estimate under multicollinearity can be interpreted as the ceteris paribus effect of the regressor on the dependent variable, because a change in one regressor is necessarily associated with a change in another. The Staged Regression approach allows more realistic interpretation of its estimator as the total and the net effects, respectively, taking into consideration the correlation among regressors.

The GORC is but one among many possible methods that can be used to determine exactly where inside the ellipse one conjectures the true coefficients. For example, one can drop the “guaranteed outperforming” criterion and simply choose the largest possible value inside the ellipse (which would be the upper bound of β1), or simply chooses the medium value (the upper bound of β1 divided by 2). The choice really depends on exactly what a priori information and subjective preference the researcher has. Further, while the GORC solution of b 1, rc would be one that embeds the “outperforming” element, it does not necessarily “outperform” all other potential solutions even if the latter does not specifically embed the “outperforming” element. This is because the GORC guarantees outperforming the OLS in each and every possible sample, but this does not necessarily imply the best average performance over repeated samples.

The PCA and the PLSs methods are not included because they are irrelevant in the “wrong sign” context and cannot have any apple-to-apple comparison with the RC. The “wrong sign” context requires that the original, real-world variables are used such that their coefficients are interpretable and thus possible to tell if a “wrong sign” has occurred. The PCA and PLS by mixing up the original variables no longer share the same variables with the RC and thus do not fit into the comparison framework. Besides, the coefficient estimates of the PCA and of the PLS are difficult to interpret, let alone to judge which sign is right or wrong for them.

Note that the RDG, DRP and RC estimates and R 2 are nevertheless calculated even in the “right sign” cases as if they were the “wrong sign” cases. The purpose is to calculate the R 2 loss in case the a priori knowledge is incorrect, since incorrect a priori information takes the right sign as wrong.

The 50, 000−14, 995=35, 005 cases when the OLS estimate has the right sign can be used as the cases when the GORC is applied with wrong prior information, because the wrong information takes the right signs as wrong.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.