![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
General prediction formulas involving Hermite polynomials are developed for time series expressed as a transformation of a Gaussian process. The prediction gains over linear predictors are examined numerically, demonstrating the improvement of nonlinear prediction.
1. Introduction
The general prediction problem is to compute the best predictor of a random variable Y given a data vector , where a joint distribution is presumed to exist for Y and
. If we define ‘best’ according to mean squared error (MSE) loss,Footnote1 the best predictor (when the random variables are square integrable) is the conditional expectation
, which in the case of Gaussian variables is a linear function of
. This linear function is completely computable in terms of first and second moments of the joint vector
, as discussed in Brockwell and Davis (Citation2013, Chapter 2). The problem can also be generalised to projection on infinite data sets, which arise in forecasting and signal extraction problems.
The theory for linear predictors is very well understood and is commonly applied to non-Gaussian data because it is simple to compute. Nevertheless, there can be a substantial predictive loss when non-Gaussian features are present in the data, such as asymmetry and excess kurtosis (Brockett et al., Citation1988; Maravall, Citation1983). A common technique for handling such raw time series data is to apply a transformation that reduces asymmetry and kurtosis, thereby generating cumulants in the transformed space that more closely resemble Gaussian cumulants. Box-Cox transforms are an example of such functions, and are typically identified through exploratory analysis or via metadata; see discussion in McElroy (Citation2016).
This paper provides exact formulas for non-linear prediction in scenarios where the non-Gaussian data process can be expressed as a univariate transformation of some Gaussian process. For some special cases, such as the log-normal distribution, exact formulas are already available for nonlinear predictors; here the general case is developed. The main result of the paper (Section 2) is an expansion of the conditional expectation in terms of Hermite coefficients of the transformation function – an idea that was utilised in Janicki and McElroy (Citation2016) to model marginal quantiles. Here this technique is used to derive analytical expressions for predictors, along with the MSE; the solution is given as an explicit function of the Hermite coefficients of the transformation function, the various Hermite polynomials evaluated at the linear predictor, and further weights explicitly determined from the mean squared error of the linear predictor.
These results are general, in the sense that they can be applied to diverse contexts in statistics, such as linear models, spatial statistics, and multivariate analysis. But our applications are focussed on time series, and in particular on forecasting problems. In forecasting, the data vector is a sample of size T from a time series
, and
represents the next unobserved value of the process. Backcasting involves setting
, and missing value problems can similarly be addressed by letting
and
; see McElroy and McCracken (Citation2017) for a recent treatment. These facets of the general methodology are developed in Section 2, and numerical comparisons are given in Section 3. The proofs are in an Appendix.
2. Nonlinear prediction
Our goal is to compute the minimal mean squared error (MSE) estimate of Y given data (finite-sample case) or
(semi-infinite sample case). This estimate is the conditional expectation
, denoted
for short. We presume that there exists an invertible function f such that
yields a Gaussian process
. Moreover,
is a Gaussian random variable whose joint distribution with the process
is known. In the case of forecasting, backcasting, or missing value imputation,
for some t.
In the following development, it is important to impose that be standard normal, although this would rarely be the case if f is obtained by exploratory analysis. (For example, if
and
by exploratory analysis, it would rarely be the case that
would have unit variance.) Let
denote the standardisation of
, i.e.,
. We will define g as the inverse map from
to Y, so that
The mapping
allows us to obtain a Hermite expansion of g; if the marginal distribution of
were non-Gaussian, we could instead have recourse to the Appell polynomials (Varma, Citation1951).
The main result of the paper is an expression for in terms of
, with
. This is of interest because such a Gaussian conditional expectation has a well-known linear formula; see Chapter 4 of McElroy and Politis (Citation2020). In applications, one might transform the data by applying f, then model the Gaussian process, and finally compute
by plugging into the linear formulas. Then the main formula of this paper can be used to obtain
by inserting
and its MSE, denoted by V, which would also be available in applications. In particular,
(1)
(1) and is given by formulas in McElroy and Politis (Citation2020); also, see (Equation11
(11)
(11) ) below. We define the space
(where Φ is the standard normal cumulative distribution function) as all functions that are square integrable with respect to the measure
. An inner product on this space is defined via
, where ϕ is the standard normal probability density function. Hence, we can also write
for
. It follows that we can do a Hermite expansion on g (see Janicki & McElroy, Citation2016 for background):
(2)
(2) with
the normalised Hermite polynomials (Roman, Citation1984), and
the Hermite coefficients. The Hermite polynomials are defined via
for
, and form a complete orthonormal system. (Hence, the coefficients
tend to zero as
.) Plugging
into (Equation2
(2)
(2) ) yields
, and applying the conditional expectation operator (which is linear) yields
(3)
(3) Evidently, the nonlinear predictor can be computed in terms of conditional expectations of
, and the formula is summarised in our main theorem below. The proof relies upon the Hermite generating function
(4)
(4) (The second equality follows from the definition of the Hermite polynomials, and is discussed in Roman (Citation1984).) From (Equation4
(4)
(4) ) we see that
(5)
(5) Therefore,
. We next discuss the optimal predictor.
Theorem 2.1
Suppose that is standard normal, and
with g given by (Equation2
(2)
(2) ), and
is the linear prediction in the transformed space, with MSE given by V in (Equation1
(1)
(1) ). Then the MSE optimal nonlinear predictor
of Y given the data
is
(6)
(6) where
is the ℓth moment of a Gaussian variable of variance V, i.e.,
equals 1 if
, and equals zero if ℓ is odd and equals
if ℓ is even. With
, the optimal prediction MSE is
(7)
(7)
The optimal predictor (Equation6(6)
(6) ) can be explicitly computed to any desired level of accuracy, truncating the first summation over k to a desirable level. Note that this result applies to finite-samples, semi-infinite samples, and bi-infinite samples, allowing us to apply (Equation6
(6)
(6) ) to forecasting (from an infinite past) and missing value problems.
Remark 2.1
Since , we see that the Hermite coefficients are square summable if
, and the MSE is well-defined, being bounded above by
. (Note that
because
has variance one.) Because
, we see that
is square integrable, and hence the formula (Equation6
(6)
(6) ) converges.
Remark 2.2
In applications, we typically will know rather than
, but the latter can be obtained from the former by applying the normalising transform described above, i.e.,
Remark 2.3
To apply Theorem 2.1 we must obtain the Hermite coefficients . Either g is known (obtained by exploratory analysis) or estimated (see Janicki & McElroy, Citation2016), and then the Hermite coefficients can be obtained via Monte Carlo simulation:
for
i.i.d. standard normal. Another approach to computation involves the generating function:
The last expression denotes the k-fold derivative of the function. In cases where g is explicitly known, this can be an easier approach to getting the Hermite coefficients.
In the following examples we suppose that and
are known and available to the practitioner; we present various cases of transforms f, and apply the results of Theorem 2.1.
Example 2.1
Gaussian
A simple affine transformation ensures that Y is still Gaussian, with mean μ and variance
. In this case
and
, and we more simply have
.
Example 2.2
Lognormal
Suppose ; applying the generating function method of Remark 2.3, we obtain
for all
. This can be utilised in (Equation3
(3)
(3) ), together with (Equation5
(5)
(5) ), to yield
which corresponds to the result of McElroy (Citation2010). Applying Theorem 2.1 to compute the optimal MSE, we see that
Example 2.3
Uniform
For with standard normal marginal, set
, so that
has a marginal distribution that is uniform on
. Then by Remark 2.3, we find that
, and hence
where
.
Example 2.4
Logistic
Consider a logistic transform given by . The first few derivatives of g are
By Monte Carlo, we obtain
,
,
, and
.
Example 2.5
Square
With , we have
,
,
, and
for k>2. Then the optimal nonlinear predictor is
. It is simple to check that the error is
which has mean zero and is independent of all functions of the data. The prediction MSE is
.
3. Comparing linear and nonlinear prediction
It is of interest to understand how much benefit nonlinear prediction provides. Clearly, if g is affine (Example 2.1) then the minimal MSE is equal to , the same as the linear prediction, but we can expect gains to the degree that g differs from the affine case.
Remark 3.1
A related nonlinear predictor that is sometimes used in applications is defined via , but unfortunately this estimator can be biased. Following the same arguments used in the proof of Theorem 2.1,
so that the expectation of the quantity in parentheses is
Hence there is no guarantee that the bias is zero.
More properly, a comparison can be made to the best linear predictor. When is a strictly stationary time series, then
is as well, and we can in some cases determine the best linear estimator's MSE for comparison. Note that in this special case, the mean and variance of each variable
is the same, and hence without any loss of generality we may assume that
is standardised, i.e., each
is standard normal. Because
, it follows from Taniguchi and Kakizawa (Citation2000, p. 319) that
and
(8)
(8) where
and
are the autocovariance sequences of
and
, respectively. So in principle we can understand the second order structure of
in terms of the Hermite coefficients and the original autocovariances.
Suppose further that we are interested in one-step ahead forecasting from a sample of size T. The best linear predictor is obtained by solving the Yule-Walker equations in , and the MSE of such is given by
(9)
(9) where
is the T-dimensional Toeplitz covariance matrix with jkth entry
. We know that such an MSE must be greater than the minimal MSE provided in Theorem 2.1, with equality occuring only in the case that a linear estimator is globally optimal (e.g., the time series is linear, or is Gaussian). If instead we are forecasting from an infinite past, then the MSE of the linear predictor is the innovation variance
given by Kolmogorov's formula:
(10)
(10) Here,
is the spectral density of
. Therefore, for either a finite sample or for an infinite past, we can determine the linear predictor MSE for a stationary process
by first computing
from
via (Equation8
(8)
(8) ), followed by application of (Equation9
(9)
(9) ) or (Equation10
(10)
(10) ) as each case requires. As for the best (nonlinear) predictor, its MSE is given by (Equation7
(7)
(7) ) of Theorem 2.1, where
(11)
(11) is the analogue of (Equation9
(9)
(9) ) for the
process.
We provide an illustration in the case of an MA(1) process with various values of θ, and sample size T = 100. The innovation variance is set equal to so that
, as required by the above discussion. For the MA(1) process with T = 100, the value of V given by (Equation11
(11)
(11) ) is the same (up to the fourth decimal place) as the innovation variance
. For transformations, we study
,
, and the logistic. Observe that from (Equation8
(8)
(8) ) the process
will be m-dependent if
is (although the converse need not be true). Hence, if we obtained a sample from
it would likely be identified with an MA(q) model, and the parameter estimates (e.g., obtained using a Whittle likelihood, which is valid for non-Gaussian processes so long as the cumulants are summable) would likely converge to those corresponding to the spectral factorisation of
. Thus, our illustration provides an accurate rendition of the prediction MSE one would obtain in the case of linear or nonlinear predictors, only with the impact of parameter estimation error completely removed (Tables ).
Table 1. MSE for linear and non-linear predictors applied to a squared MA(1) process of parameter θ.
Table 2. MSE for linear and non-linear predictors applied to an exponential MA(1) process of parameter θ.
Table 3. MSE for linear and non-linear predictors applied to a logistic MA(1) process of parameter θ.
In each case, the degree of benefit to nonlinear prediction increases with θ, as to be expected; however, there are large discrepancies between the three functions. When , the logistic transformation offers only a
improvement with nonlinear prediction, indicating that linear prediction is almost just as good as the conditional expectation. With
the analogous improvement is
, and is
for
, indicating some real benefit to nonlinear prediction.
We end with a remark on how a confidence interval can be constructed using simulations. Let the formula given by (Equation6(6)
(6) ) be denoted
. Then the optimal prediction error can be written
where
is the (linear) prediction error for the Gaussian variable, and is uncorrelated with (and hence independent of)
. Also
. Therefore, in cases where it is easy to simulate
(e.g., suppose we are forecasting from a Gaussian ARMA process) we can independently draw δ and compute ε for repeated Monte Carlo draws, thereby obtaining a confidence interval for Y.
Disclaimer
This report is released to inform interested parties of research and to encourage discussion. The views expressed on statistical issues are those of the authors and not those of the U.S. Census Bureau.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Notes on contributors
Tucker McElroy
Tucker McElroy is Senior Time Series Mathematical Statistician at the U.S. Census Bureau.
Srinjoy Das
Srinjoy Das is a Postdoctoral Scholar in the Mathematics department at the University of California, San Diego.
Notes
1 Other loss functions could of course be envisioned (mean absolute loss yields the conditional median, for example).
References
- Brockett, P. L., Hinich, M. J., & Patterson, D. (1988). Bispectral-based tests for the detection of Gaussianity and linearity in time series.Journal of the American Statistical Association, 83(403), 657–664. https://doi.org/10.1080/01621459.1988.10478645
- Brockwell, P. J., & Davis, R. A. (2013). Time series: Theory and methods. Springer Science & Business Media.
- Janicki, R., & McElroy, T. (2016). Hermite expansion and estimation of monotonic transformations of Gaussian data. Journal of Nonparametric Statistics, 28(1), 207–234. https://doi.org/10.1080/10485252.2016.1139880
- Maravall, A. (1983). An application of nonlinear time series forecasting. Journal of Business & Economic Statistics, 1(1), 66–74. https://doi.org/10.1080/07350015.1983.10509325
- McElroy, T. (2010). A nonlinear algorithm for seasonal adjustment in multiplicative component decompositions. Studies in Nonlinear Dynamics and Econometrics, 14(4). Article 6. https://doi.org/10.2202/1558-3708.1756
- McElroy, T. (2016). On the measurement and treatment of extremes in time series. Extremes, 19(3), 467–490. https://doi.org/10.1007/s10687-016-0254-4
- McElroy, T., & McCracken, M. (2017). Multi-step ahead forecasting of vector time series. Econometric Reviews, 36(5), 495–513. https://doi.org/10.1080/07474938.2014.977088
- McElroy, T., & Politis, D. (2020). Time series: A first course with bootstrap starter. Chapman Hall.
- Roman, S. (1984). The umbral calculus. Academic Press.
- Taniguchi, M., & Kakizawa, Y. (2000). Asymptotic theory of statistical inference for time series. Springer.
- Varma, R. S. (1951). On Appell polynomials. Proceedings of the American Mathematical Society, 2(4), 593–596. https://doi.org/10.1090/S0002-9939-1951-0042547-5
Appendix
Proof of Theorem 2.1
From (Equation3(3)
(3) ) and (Equation5
(5)
(5) ) we obtain
We can write
, and using the property that
is independent of all functions of the data, we obtain
Hence
The prediction error is
Note that
is orthogonal to all linear functions of the data; because this error is Gaussian, it is moreover independent of all functions of the data
. It follows that
, because
Moreover, for any function
of the data,
This verifies optimality. To compute the MSE, first observe that
Note that
, because
is standard normal. Moreover, due to orthogonality,
with the two summands orthogonal, and hence
. Using these facts and again using the independence property, we take the expectation of
and obtain
Now, it is straight-forward to show that for any constant A,
Applying this with A = 1 and A = −V yields