Full article: Interpolation and correlation

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Historical time series sometimes have missing observations. It is common practice either to ignore these missing values or otherwise to interpolate between the adjacent observations and continue with the interpolated data as true data. This paper shows that interpolation changes the autocorrelation structure of the time series. Ignoring such autocorrelation in subsequent correlation or regression analysis can lead to spurious results. A simple method is presented to prevent spurious results. A detailed illustration highlights the main issues.

KEYWORDS:

I. Introduction

Historical time series sometimes have missing observations, for various reasons. Consider for example the two time series in , which are the contribution to Gross Domestic Product in Holland (in 1000 guilders) for Domestic production, trade and shipping (for convenience with acronym DPTS) and Army Navy (AN). The data are observed for 1738–1779, and as is typical for historical data, there are many missing observations. Given the time span, there could be 42 annual observations, but the number of effective observations is 24.

Figure 1. The contribution to Gross Domestic Product in Holland (in 1000 guilders) for Domestic production, trade and shipping.

Suppose one is interested in any correlation or regression relation between these two variables. One approach could now be to simply ignore the missing data and compute the correlation or run a regression. This simple solution could work well, but in case the data show autocorrelation, that is, the data in year T-1 are informative for the data year T; then, the number of missing observations may be an obstacle for analysis. Indeed, when there are just three observations, $y_{1}, y_{2}, y_{3}$ and suppose $y_{2}$ is missing then this means that one simply cannot compute a first-order autocorrelation.

An often consider alternative is to interpolate the missing observations using the adjacent observed observations. For the missing $y_{2}$ above, this would entail that it is replaced by $y_{2}^{*}$ , which is a function of $y_{1}$ and $y_{3}$ . With this replaced value, one does have information to compute the first-order autocorrelation. Frequently, researchers choose a linear interpolation scheme, and this results in data like those in , which displays some straight lines between various points. For another example, consider Figure 8A in O’Rourke and Jeffrey (Citation2002).

Figure 2. The contribution to Gross Domestic Product in Holland (in 1000 guilders) for the Army and Navy.

In this paper, I will show that there is downside to interpolation and that is that it changes the autocorrelation structure of the time series. Next, ignoring such autocorrelation in subsequent correlation or regression analysis can lead to spurious results. A simple method is presented to prevent spurious results. A detailed illustration to the abovementioned two time series highlights the main issues.

II. Correlation and interpolation

Consider again the two time series in , and suppose we are interested in measuring the relationship between these two variables. The data are observed for 1738–1779, and as is clear from , there are many missing observations. Given the time span, there could have been 42 annual observations, but the number of effective observations is 24.

The estimated correlation between these two variables is −0.269. When we consider a simple regression model and apply Ordinary Least Squares (OLS) to

D P T S_{t} = α + β A N_{t} + ε_{t}

We obtain the estimation results

a = 2834 (531.9)

b = - 1.494 (0.713)

where the numbers in parentheses are the HAC standard errors (Heteroscedasticity and Autocorrelation Consistent). The estimated t-statistic on b is −2.097, and hence, there seems to be a significant relationship, with a p value of 0.048. There does seem to be some autocorrelation in the estimated residuals as the Eviews package gives a Durbin Watson test statistic value of 0.216, which is closer to 0 than to 2 (the value if there is no autocorrelation).

We may now want to increase the effective sample size from 24 to 42 by interpolating the missing observations. An often-applied technique is to draw a straight line between the begin and end points of a period with missing data, and the use the points on that line as the new observations. For example, suppose there are three observations $y_{1}, y_{2}, y_{3}$ , and suppose the observation on $y_{2}$ is missing. One can then decide to insert for $y_{2}$ :

y_{1} + \frac{1}{2} (y_{3} - y_{1}) = \frac{1}{2} y_{1} + \frac{1}{2} y_{3}

In general, if there are $m - 1$ missing observations in between any $y_{1}$ and $y_{m + 1}$ , then the interpolation scheme is

y_{1} + \frac{1}{m} (y_{m + 1} - y_{1}) = \frac{m - 1}{m} y_{1} + \frac{1}{m} y_{m + 1}

\frac{m - 2}{m} y_{1} + \frac{2}{m} y_{m + 1}

\frac{1}{m} y_{1} + \frac{m - 1}{m} y_{m + 1}

If this scheme is applied to the two variables with 42–24 = 18 missing observations, we obtain the data as depicted in .

The correlation between these two interpolated series is −0.333, a slight increase relative to the −0.269 before, at least in an absolute sense. When we consider a simple regression model to these interpolated series, and apply OLS to

D P T S_i n t e r p o l a t e d_{t} = α + β A N_i n t e r p o l a t e d_{t} + ε_{t}

We obtain the estimation results

a = 2924 (534.0)

b = - 1.817 (0.773)

where the numbers in parentheses are again the HAC standard errors. The estimated t-statistic on b is now −2.351, and hence, there seems to be a significant relationship, now even with a p value of 0.024. The Durbin Watson test statistic value for the 42 estimated residuals is 0.186, which is even closer to 0 and further away from 2.

The question now is whether this correlation and this relation is a statistical artefact. We seem to miss out on first-order autocorrelation, given the small Durbin Watson values, and also after interpolation, the first-order autocorrelation seems to increase.

To see if there is autocorrelation in each of the variables, we compute the first-order autocorrelation like

r_{1} = \frac{\frac{1}{T} \sum_{t = 1739}^{t = 1789} (y_{t} - \overset{ˉ}{y}) (y_{t - 1} - \overset{ˉ}{y})}{\frac{1}{T} \sum_{t = 1739}^{t = 1789} {(y_{t} - \overset{ˉ}{y})}^{2}}

and also the second- to fifth-order autocorrelations. These estimates are displayed in . Comparing the columns with raw data and interpolated data, it is evident that the interpolated data have much more autocorrelation. In fact, the first-order autocorrelation seems to approach 1, which is the case of the so-called unit root, in which case one needs to resort to cointegration analysis, as is done in for example O’Rourke and Jeffrey (Citation2002).

Table 1. Autocorrelations in the various variables, computed using

Download CSV Display Table

Footnote¹Before we turn to correlation and regression analysis, we first examine the potential consequences of interpolation for the time series.

III. What does interpolation do?

To understand what interpolation does to the time series properties of variables, consider the following very simple and stylized case. Consider the following four observations:

ε_{1}, ε_{2}, ε_{3}, ε_{4}

and assume that each of these four observations is a draw from a white noise process, that is they have mean zero, variance $σ^{2}$ , and they are uncorrelated, that is, the correlation between any $ε_{i}$ and $ε_{j}$ for $i \neq j$ is zero. It is now easy to derive that

r_{1} = \frac{\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t} ε_{t - 1})}{\frac{1}{3} \sum_{t = 2}^{t = 4} {(ε_{t})}^{2}} = \frac{0}{σ^{2}} = 0

Next, we consider three distinct cases when one or two observations can be missing. We use the linear interpolation technique, but for alternative methods, qualitatively similar results can be obtained, although the notation and mathematical expressions quickly become involved.

Case a: $ε_{2}$ is missing and is interpolated using $ε_{1}$ and $ε_{3}$

In this case, the data series becomes

ε_{1}, \frac{1}{2} ε_{1} + \frac{1}{2} ε_{3}, ε_{3}, ε_{4}

Call these observations $ε_{t}^{*}$

For this data series, it holds that

\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t}^{*} ε_{t - 1}^{*}) = \frac{1}{3} (\frac{1}{2} σ^{2} + \frac{1}{2} σ^{2} + 0) = \frac{1}{3} σ^{2}

and that

\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t}^{*})^{2} = \frac{1}{3} (\frac{1}{4} σ^{2} + \frac{1}{4} σ^{2} + σ^{2} + σ^{2}) = \frac{5}{6} σ^{2}

Hence, the first-order autocorrelation becomes

r_{1} = \frac{\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t}^{*} ε_{t - 1}^{*})}{\frac{1}{3} \sum_{t = 2}^{t = 4} {(ε_{t}^{*})}^{2}} = \frac{\frac{1}{3}}{\frac{5}{6}} = \frac{2}{5} = 0.4

So, the first-order autocorrelation quickly jumps from 0 to 0.4.

Case b: $ε_{2}$ and $ε_{3}$ are missing and are interpolated using $ε_{1}$ and $ε_{4}$

When there are two observations missing, the linear interpolation method results in the new observations

ε_{1}, \frac{2}{3} ε_{1} + \frac{1}{3} ε_{4}, \frac{1}{3} ε_{1} + \frac{2}{3} ε_{4}, ε_{4}

We now have that

\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t}^{*} ε_{t - 1}^{*}) = \frac{1}{3} (\frac{2}{3} σ^{2} + \frac{2}{9} σ^{2} + \frac{2}{9} σ^{2} + \frac{2}{3} σ^{2}) = \frac{16}{27} σ^{2}

and that

\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t}^{*})^{2} = \frac{1}{3} (\frac{5}{9} σ^{2} + \frac{5}{9} σ^{2} + σ^{2}) = \frac{19}{27} σ^{2}

which makes the first-order autocorrelation to become

r_{1} = \frac{\frac{1}{3} \sum_{t = 2}^{t = 4} (ε_{t}^{*} ε_{t - 1}^{*})}{\frac{1}{3} \sum_{t = 2}^{t = 4} {(ε_{t}^{*})}^{2}} = \frac{16}{19} ≅ 0.84

The differences between cases a and b are that, as could be expected, the variance decreases (from $\frac{5}{6} σ^{2}$ to $\frac{19}{27} σ^{2}$ ), and that the first-order autocorrelation increases (from 0.4 to 0.84).

Case c: $ε_{2}$ is missing and is interpolated using $ε_{1}$ and $ε_{3}$ , while there are five observations, that is, one more than case a.

In this case, we thus have

ε_{1}, \frac{1}{2} ε_{1} + \frac{1}{2} ε_{3}, ε_{3}, ε_{4}, ε_{5}

which gives

\frac{1}{4} \sum_{t = 2}^{t = 5} (ε_{t}^{*} ε_{t - 1}^{*}) = \frac{1}{4} (\frac{1}{2} σ^{2} + \frac{1}{2} σ^{2} + 0 + 0) = \frac{1}{4} σ^{2}

and

\frac{1}{4} \sum_{t = 2}^{t = 5} (ε_{t}^{*})^{2} = \frac{1}{4} (\frac{1}{4} σ^{2} + \frac{1}{4} σ^{2} + σ^{2} + σ^{2} + σ^{2}) = \frac{7}{8} σ^{2}

resulting in

r_{1} = \frac{\frac{1}{4} \sum_{t = 2}^{t = 5} (ε_{t}^{*} ε_{t - 1}^{*})}{\frac{1}{4} \sum_{t = 2}^{t = 5} {(ε_{t}^{*})}^{2}} = \frac{2}{7} ≅ 0.29

So, when the number of non-interpolated data decreases, the first-order autocorrelation also decreases.

IV. What does neglected autocorrelation do?

Already in Udny (Citation1926), the issue of spurious correlation was raised, which basically occurs due to neglected autocorrelation. When the data have trends, Granger and Newbold (Citation1974) showed that high-valued nonsense correlations can occur. In Phillips (Citation1986), it was shown that for trended data such nonsense correlations can be obtained when people rely on inappropriate statistical methodology. Later, Granger, Hyung, and Jeon (Citation2001) derived the asymptotic distribution of the t test on the parameter $β$ in the simple regression

y_{t} = α + β x_{t} + ε_{t}

where in reality

y_{t} = ρ y_{t - 1} + u_{t}

x_{t} = ρ x_{t - 1} + w_{t}

that is, the two variables are independent first-order autoregressive time series with the same parameter $ρ$ . This asymptotic distribution is

t_{β} \sim N (0, \frac{1 + ρ^{2}}{1 - ρ^{2}})

instead of the commonly considered $N (0, 1)$ distribution. Clearly,

\frac{1 + ρ^{2}}{1 - ρ^{2}} > 1

and hence, more often significant test values will be found. Indeed, Table 5A.1 in Franses (Citation2018) shows that when for example $ρ = 0.7$ , one will obtain around 23% significant t test values. And, in this case, the average absolute correlation between $y_{t}$ and $x_{t}$ is 0.131 for 100 observations, and as large as 0.248 for 25 observations.

In sum, neglected autocorrelation leads to spurious relations.

V. How to prevent spurious results?

A simple remedy to prevent spurious results is to explicitly incorporate lags of the variables. In our illustrative example, this means that we move from

D P T S_i n t e r p o l a t e d_{t} = α + β A N_i n t e r p o l a t e d_{t} + ε_{t}

D P T S_i n t e r p o l a t e d_{t} = α + β A N_i n t e r p o l a t e d_{t} + γ D P T S_i n t e r p o l a t e d_{t - 1} + ε_{t}

The OLS estimation results for this extended model are

a = 179.3 (206.8)

b = - 0.141 (0.229)

c = 0.990 (0.085)

Clearly, the parameter for Army Navy is now insignificant.

Given the high-valued autocorrelations in , and also given the estimate of 0.990 for $c$ , one can also correlate the differences of the two variables, thereby imposing that each of the two has a unit root. Then, the regression becomes

D P T S_i n t e r p o l a t e d_{t} - D P T S_i n t e r p o l a t e d_{t - 1} = α + β (A N_i n t e r p o l a t e d_{t} - A N_i n t e r p o l a t e d_{t - 1}) + ε_{t}

and the OLS estimation results (with HAC standard errors) are

a = 97.76 (53.28)

b = - 0.082 (0.247)

Clearly, there is no significant link between the two differenced variables.

VI. Conclusion

When analysing historical time series with missing observations, it is a common practice either to ignore these missing values or otherwise to interpolate between the adjacent observations and continue with the interpolated data as true data. In this paper, we have shown that interpolation changes the autocorrelation structure of the time series. Ignoring such autocorrelation in subsequent correlation or regression analysis could lead to spurious results. A simple method was presented to prevent spurious results. A detailed illustration highlighted the main issues and showed that presumably none-zero correlation disappears when the data are analysed properly.

Further research should indicate how often spurious correlations appear in historical research.

r_{i} = \frac{\frac{1}{T} \sum_{t = 1738 + i}^{t = 1789} (y_{t} - \overset{ˉ}{y}) (y_{t - i} - \overset{ˉ}{y})}{\frac{1}{T} \sum_{t = 1738 + i}^{t = 1789} {(y_{t} - \overset{ˉ}{y})}^{2}}

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

¹ The data source is Brandon, P. and U. Bosma (2019), Calculating the weight of slave-based activities in the GDP of Holland and the Dutch Republic – Underlying methods, data and assumptions, The Low Countries Journal of Social and Economic History, 16 (2), 5–45, doi: 10.18352/tseg.1082

References

Franses, P. H. 2018. Enjoyable Econometrics. Cambridge UK: Cambridge University Press.
Google Scholar
Granger, C. W. J., N. Hyung, and Y. Jeon. 2001. “Spurious Regressions with Stationary Series.” Applied Economics 33 (7): 899–904. doi:https://doi.org/10.1080/00036840121734.
Web of Science ®Google Scholar
Granger, C. W. J., and P. Newbold. 1974. “Spurious Regression in Economics.” Journal of Econometrics 2 (2): 111–120. doi:https://doi.org/10.1016/0304-4076(74)90034-7.
Google Scholar
O’Rourke, K. H., and G. W. Jeffrey. 2002. “When Did Globalization Begin?” European Review of Economic History 6 (1): 23–50. doi:https://doi.org/10.1017/S1361491602000023.
Web of Science ®Google Scholar
Phillips, P. C. B. 1986. “Understanding Spurious Regressions in Econometrics.” Journal of Econometrics 33 (3): 311–340. doi:https://doi.org/10.1016/0304-4076(86)90001-1.
Web of Science ®Google Scholar
Udny, Y. G. 1926. “Why Do We Sometimes Get Nonsense Correlations between Time Series? A Study in Sampling and the Nature of Time Series.” Journal of the Royal Statistical Society 89 (1): 1–64. doi:https://doi.org/10.2307/2341482.
Google Scholar

Interpolation and correlation

ABSTRACT

I. Introduction

II. Correlation and interpolation

Table 1. Autocorrelations in the various variables, computed using

III. What does interpolation do?

IV. What does neglected autocorrelation do?

V. How to prevent spurious results?

VI. Conclusion

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Interpolation and correlation

ABSTRACT

I. Introduction

II. Correlation and interpolation

Table 1. Autocorrelations in the various variables, computed using

III. What does interpolation do?

IV. What does neglected autocorrelation do?

V. How to prevent spurious results?

VI. Conclusion

Disclosure statement

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date