1,380
Views
0
CrossRef citations to date
0
Altmetric
Articles

Spurious principal components

&

ABSTRACT

The principal component regression (PCR) is often used to forecast macroeconomic variables when there are many predictors. In this letter, we argue that it makes sense to pre-whiten the predictors before including these in a PCR. With simulation experiments, we show that without such pre-whitening, spurious principal components can appear and that these can become spuriously significant in a PCR. With an illustration to annual inflation rates for five African countries, we show that non-spurious principal components can be genuinely relevant in empirical forecasting models.

JEL CLASSIFICATION:

I. Introduction and motivation

The principal component regression (PCR) is a frequently considered model to forecast macroeconomic variables when there are many predictors, see Stock and Watson (Citation1999; Citation2002), Bernanke, Boivin, and Eliasz (Citation2005), Heij, Van Dijk, and Groenen (Citation2011) and many others. The idea of the PCR is that the predictors are summarized in a few principal components, and that these new variables enter as explanatory variables in a regression model. When summarizing the predictors, it is typical practice to consider growth rates of the predictors in case of unit roots, but otherwise the variables are usually included as they are. In this letter, we recommend to pre-whiten all predictors, that is, to fit for example autoregressive models to the data, and use the residuals as the new predictors in principal components analysis (PCA). When the PCA results for raw and pre-whitened data are similar, one may well have found non-spurious principal components.

We base our recommendation on a few simulation experiments, which show that without such pre-whitening one runs the risk of finding spurious principal components, and finding spuriously significant newly created regressors in the PCR. The arguments why one can obtain spurious effects are the same as those echoed in Yule (Citation1926), Ames and Reiter (Citation1961) and, of course, Granger and Newbold (Citation1974).

An illustration of how a PCR can look like in case of spurious and non-spurious principal components is also given.

II. Simulation experiments

Consider the creation of four time series variables, using the data generating process (DGP):

Hence, there are four independent variables, each generated as a first-order autoregression. The error terms are all independent draws from a standard normal distribution. The starting values are always equal to 0. In the simulations, t will run from 1 to 50, or 100, or 500.

First, we create principal components for the variables , which is done based on the correlation matrix of these three variables. This implies that the sum of the eigenvalues is equal to 3. If the three variables each would be a white noise process, then the estimated eigenvalues should all be about equal to 1. However, when the autoregressive parameter deviates further away from 0 and approaches 1, we may expect that there will appear spurious non-zero correlations across the variables, as already demonstrated in Yule (Citation1926), and hence we may expect that the first eigenvalue will deviate away from 1.

A confirmation of these expectations is summarized in . The cells in the first panel present the average value of the first eigenvalue and the SD, across 10,000 replications. It is clear that the larger the autoregressive parameter gets, the larger is the first eigenvalue. When the sample size increases, the deviation away from 1 gets smaller, but not much. In the second panel, we report the frequency of 5% significant parameters, associated with the first principal component in the PCR. There, we additionally have that

Table 1. The data generating process.

with like the other three variables, and where the PCR is

with denoting the first lag of the first principal component. Clearly, there are more than 5% significant parameters, but the spurious effects tend to disappear as we let the sample size increase.

presents similar information as , although now all variables have been pre-whitened, that is, for all variables we first estimate a first-order autoregression, and then we proceed with the residuals. Hence, we now first run the regressions

Table 2. The data generating process.

and we store the , and and estimate the first principal component for these residuals. From the cells in we learn that pre-whitening makes the spurious results disappear, not only for the eigenvalues and principal components, also for the PCR.

III. Illustration

What is it that we recommend to practitioners so that they can recognize non-spurious principal components? We recommend comparing the eigenvalues before and after pre-whitening. In case of non-spurious results, these eigenvalues should be similar.

Consider as an illustration the three annual inflation rates for France, Japan and the USA, see Franses and Janssens (Citation2017) for data and graphs on these data and the others later. If we fit a first-order autoregression to each of these variables, the estimated autoregressive coefficients obtain values of 0.931, 0.776 and 0.823, respectively. These values are all approaching 1, and we therefore should be wary for similar issues as have been observed in the simulation experiments earlier.

When we apply PCA on the correlation matrix, we obtain for the raw data the eigenvalues 2.425, 0.446 and 0.129, and for the residuals after fitting country-specific autoregressive models of order 1, the eigenvalues 2.359, 0.418 and 0.223. Hence, in both situations there clearly is a single dominant principal component, with 0.808 and 0.786% of the variation explained, respectively. The weights in the first principal components are 0.610, 0.535 and 0.584 for the raw data, and 0.600, 0.553 and 0.578 for the pre-whitened data. Not only are the eigenvalues very similar, also the weights are clearly very similar.

Consider now the five annual inflation rates for the North African countries Algeria, Egypt, Libya, Morocco and Tunisia. The first-order autocorrelation are 0.772, 0.704, 0.248, 0.654 and 0.096, respectively. The first eigenvalue obtained from PCA for the raw data is 2.348 and the first principal component covers 0.470 of the total variance. The weights are 0.379, 0.421, 0.539, 0.433 and 0.448. When we fit first-order autoregressions, and apply PCA to the residuals, we get a first eigenvalue of 1.870, which is associated with only 0.374 of the total variance. The weights have become 0.404, 0.213, 0.628, 0.212 and 0.594, which seem markedly different from those for the raw data. Hence, we may have found a spurious principal component here.

In , we report the estimation results for inflation in Botswana and Lesotho, two countries that are quite far away from North Africa, but for which inflation may resonate with worldwide inflation (which we assume is the first principal component for France, Japan and USA). Each first row shows that the North African principal component seems significant at close to a 5% level, while each second row shows that the World based principal component is significant at a level much less than 5%. The forecast performance of the model including the non-spurious principal component is clearly better. When we include both principal components in a single PCR, we obtain p values of 0.168 and 0.186 for the North African components, respectively. The correlation between the two principal components is only 0.335, so the low p values are not due to high correlation between these two variables. Hence, the non-spurious principal component makes the spurious component obsolete.

Table 3. Estimation results and evaluation of one-step-ahead forecasts, sample 1961–2015.

This illustration shows that comparing PCA outcomes for raw and pre-whitened data can be useful to diagnose non-spurious principal components.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

  • Ames, E., and S. Reiter. 1961. “Distributions of Correlation Coefficients in Economic Time Series.” Journal of the American Statistical Association 56: 637–656. doi:10.1080/01621459.1961.10480650.
  • Bernanke, B. S., J. Boivin, and P. Eliasz. 2005. “Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach.” The Quarterly Journal of Economics 120: 387–422.
  • Franses, P. H., and E. Janssens. 2017. “Inflation in Africa, 1960-2015, Econometric Institute Report EI-2017-26, Erasmus School of Economics.” https://repub.eur.nl/pub/102219
  • Granger, C. W. J., and P. Newbold. 1974. “Spurious Regressions in Econometrics.” Journal of Econometrics 2: 111–120. doi:10.1016/0304-4076(74)90034-7.
  • Heij, C., D. Van Dijk, and P. J. F. Groenen. 2011. “Real-Time Macroeconomic Forecasting with Leading Indicators: An Empirical Comparison.” International Journal of Forecasting 27: 466–481. doi:10.1016/j.ijforecast.2010.04.008.
  • Stock, J. H., and M. W. Watson. 1999. “Forecasting Inflation.” Journal of Monetary Economics 44: 293–335. doi:10.1016/S0304-3932(99)00027-6.
  • Stock, J. H., and M. W. Watson. 2002. “Forecasting Using Principal Components from a Large Number of Predictors.” Journal of the American Statistical Association 97: 1167–1179. doi:10.1198/016214502388618960.
  • Yule, G. U. 1926. “Why Do We Sometimes Get Nonsense Correlations between Time-Series? A Study in Sampling and the Nature of Time-Series.” Journal of the Royal Statistical Society A 89: 1–69. doi:10.2307/2341482.