284
Views
0
CrossRef citations to date
0
Altmetric
Articles

Nonlinear prediction via Hermite transformation

&
Pages 49-54 | Received 25 Oct 2019, Accepted 24 Nov 2020, Published online: 17 Dec 2020

ABSTRACT

General prediction formulas involving Hermite polynomials are developed for time series expressed as a transformation of a Gaussian process. The prediction gains over linear predictors are examined numerically, demonstrating the improvement of nonlinear prediction.

1. Introduction

The general prediction problem is to compute the best predictor of a random variable Y given a data vector X_, where a joint distribution is presumed to exist for Y and X_. If we define ‘best’ according to mean squared error (MSE) loss,Footnote1 the best predictor (when the random variables are square integrable) is the conditional expectation E[Y|X_], which in the case of Gaussian variables is a linear function of X_. This linear function is completely computable in terms of first and second moments of the joint vector (Y,X_), as discussed in Brockwell and Davis (Citation2013, Chapter 2). The problem can also be generalised to projection on infinite data sets, which arise in forecasting and signal extraction problems.

The theory for linear predictors is very well understood and is commonly applied to non-Gaussian data because it is simple to compute. Nevertheless, there can be a substantial predictive loss when non-Gaussian features are present in the data, such as asymmetry and excess kurtosis (Brockett et al., Citation1988; Maravall, Citation1983). A common technique for handling such raw time series data is to apply a transformation that reduces asymmetry and kurtosis, thereby generating cumulants in the transformed space that more closely resemble Gaussian cumulants. Box-Cox transforms are an example of such functions, and are typically identified through exploratory analysis or via metadata; see discussion in McElroy (Citation2016).

This paper provides exact formulas for non-linear prediction in scenarios where the non-Gaussian data process can be expressed as a univariate transformation of some Gaussian process. For some special cases, such as the log-normal distribution, exact formulas are already available for nonlinear predictors; here the general case is developed. The main result of the paper (Section 2) is an expansion of the conditional expectation in terms of Hermite coefficients of the transformation function – an idea that was utilised in Janicki and McElroy (Citation2016) to model marginal quantiles. Here this technique is used to derive analytical expressions for predictors, along with the MSE; the solution is given as an explicit function of the Hermite coefficients of the transformation function, the various Hermite polynomials evaluated at the linear predictor, and further weights explicitly determined from the mean squared error of the linear predictor.

These results are general, in the sense that they can be applied to diverse contexts in statistics, such as linear models, spatial statistics, and multivariate analysis. But our applications are focussed on time series, and in particular on forecasting problems. In forecasting, the data vector X_=[X1,,XT] is a sample of size T from a time series {Xt}, and Y=XT+1 represents the next unobserved value of the process. Backcasting involves setting Y=X0, and missing value problems can similarly be addressed by letting Y=Xt and X_=[X1,,Xt1,Xt+1,,XT]; see McElroy and McCracken (Citation2017) for a recent treatment. These facets of the general methodology are developed in Section 2, and numerical comparisons are given in Section 3. The proofs are in an Appendix.

2. Nonlinear prediction

Our goal is to compute the minimal mean squared error (MSE) estimate of Y given data X_=[X1,,XT] (finite-sample case) or X_={Xs:sT} (semi-infinite sample case). This estimate is the conditional expectation E[Y|X_], denoted Y^ for short. We presume that there exists an invertible function f such that Zt=f(Xt) yields a Gaussian process {Zt}. Moreover, f(Y) is a Gaussian random variable whose joint distribution with the process {Zt} is known. In the case of forecasting, backcasting, or missing value imputation, f(Y)=Zt for some t.

In the following development, it is important to impose that f(Y) be standard normal, although this would rarely be the case if f is obtained by exploratory analysis. (For example, if Y=XT+1 and f(x)=log(x) by exploratory analysis, it would rarely be the case that log(XT+1) would have unit variance.) Let Z denote the standardisation of f(Y), i.e., Z=(f(Y)E[f(Y)])/Var[f(Y)]. We will define g as the inverse map from Z to Y, so that g(x)=f1(xVar[f(Y)]+E[f(Y)]).The mapping Y=g(Z) allows us to obtain a Hermite expansion of g; if the marginal distribution of Z were non-Gaussian, we could instead have recourse to the Appell polynomials (Varma, Citation1951).

The main result of the paper is an expression for E[Y|X_] in terms of E[Z|Z_], with Z_=[Z1,,ZT]. This is of interest because such a Gaussian conditional expectation has a well-known linear formula; see Chapter 4 of McElroy and Politis (Citation2020). In applications, one might transform the data by applying f, then model the Gaussian process, and finally compute E[Z|Z_] by plugging into the linear formulas. Then the main formula of this paper can be used to obtain E[Y|X_] by inserting E[Z|Z_] and its MSE, denoted by V, which would also be available in applications. In particular, (1) V=E[(ZE[Z|Z_])2](1) and is given by formulas in McElroy and Politis (Citation2020); also, see (Equation11) below. We define the space L2(dΦ) (where Φ is the standard normal cumulative distribution function) as all functions that are square integrable with respect to the measure dΦ. An inner product on this space is defined via f,h=f(x)h(x)ϕ(x)dx, where ϕ is the standard normal probability density function. Hence, we can also write f,g=E[f(W)g(W)] for WN(0,1). It follows that we can do a Hermite expansion on g (see Janicki & McElroy, Citation2016 for background): (2) g(x)=k=0JkHk(x)(2) with Hk the normalised Hermite polynomials (Roman, Citation1984), and Jk=g,Hk the Hermite coefficients. The Hermite polynomials are defined via Hk(x)=1k!(1)kex2/2xkex2/2for k0, and form a complete orthonormal system. (Hence, the coefficients Jk tend to zero as k.) Plugging Z into (Equation2) yields Y=k=0JkHk(Z), and applying the conditional expectation operator (which is linear) yields (3) Y^=E[Y|X_]=k=0JkE[Hk(Z)|X_].(3) Evidently, the nonlinear predictor can be computed in terms of conditional expectations of Hk(Z), and the formula is summarised in our main theorem below. The proof relies upon the Hermite generating function (4) h(x,t)=exp{xtt2/2}=k=0tkk!Hk(x).(4) (The second equality follows from the definition of the Hermite polynomials, and is discussed in Roman (Citation1984).) From (Equation4) we see that (5) tkh(x,t)|t=0=k!Hk(x).(5) Therefore, Jk=E[g(W)Hk(W)]=1k!tkE[g(W)h(W,t)]|t=0. We next discuss the optimal predictor.

Theorem 2.1

Suppose that Z is standard normal, and Y=g(Z) with g given by (Equation2), and Z^=E[Z|Z_] is the linear prediction in the transformed space, with MSE given by V in (Equation1). Then the MSE optimal nonlinear predictor Y^ of Y given the data X_ is (6) E[Y|X_]=k=0Jkk!=0k(k)!H(Z^)κk,(6) where κ is the ℓth moment of a Gaussian variable of variance V, i.e., κ equals 1 if =0, and equals zero if ℓ is odd and equals V(1)!! if ℓ is even. With ε=YY^, the optimal prediction MSE is (7) E[ε2]=k=1Jk2(1(1V)k).(7)

The optimal predictor (Equation6) can be explicitly computed to any desired level of accuracy, truncating the first summation over k to a desirable level. Note that this result applies to finite-samples, semi-infinite samples, and bi-infinite samples, allowing us to apply (Equation6) to forecasting (from an infinite past) and missing value problems.

Remark 2.1

Since k=0Jk2=g,g, we see that the Hermite coefficients are square summable if gL2(dΦ), and the MSE is well-defined, being bounded above by k=1Jk2Vk. (Note that V1 because Z has variance one.) Because Var[Y^2]=Var[Y2]E[ε2], we see that Y^ is square integrable, and hence the formula (Equation6) converges.

Remark 2.2

In applications, we typically will know f(Y)|Z_ rather than Z|Z_, but the latter can be obtained from the former by applying the normalising transform described above, i.e., E[Z|Z_]=(E[f(Y)|Z_]E[f(Y)])/Var[f(Y)]V=E[(f(Y)E[f(Y)|Z_])2]/Var[f(Y)].

Remark 2.3

To apply Theorem 2.1 we must obtain the Hermite coefficients Jk. Either g is known (obtained by exploratory analysis) or estimated (see Janicki & McElroy, Citation2016), and then the Hermite coefficients can be obtained via Monte Carlo simulation: Jk=g,Hk=E[g(W)Hk(W)]M1m=1Mg(Wm)Hk(Wm)for W1,,Wm i.i.d. standard normal. Another approach to computation involves the generating function: Jk=1k!tkE[g(W)h(W,t)]|t=0=12πk!tkg(w)exp{wtt2/2}×exp{w2/2}dw|t=0=12πk!tkg(w)exp{(wt)2/2}dw|t=0=12πk!tkg(w+t)exp{w2/2}dw|t=0=1k!tkE[g(W+t)]|t=0=1k!E[g(k)(W)].The last expression denotes the k-fold derivative of the function. In cases where g is explicitly known, this can be an easier approach to getting the Hermite coefficients.

In the following examples we suppose that Z^=E[Z|Z_] and V=E[(Z^Z)2] are known and available to the practitioner; we present various cases of transforms f, and apply the results of Theorem 2.1.

Example 2.1

Gaussian

A simple affine transformation g(x)=σx+μ ensures that Y is still Gaussian, with mean μ and variance σ2. In this case J0=μ and J1=σ, and we more simply have E[Y|X_]=σZ^+μ.

Example 2.2

Lognormal

Suppose g(x)=ex; applying the generating function method of Remark 2.3, we obtain Jk=1k!E[exp{W}]=e1/2k!for all k0. This can be utilised in (Equation3), together with (Equation5), to yield E[Y|X_]=e1/2k=01k!tk(h(Z^,t)eVt2/2)|t=0=e1/2(h(Z^,t)eVt2/2)|t=1=e1/2exp{Z^+(V1)/2}=exp{Z^+V/2},which corresponds to the result of McElroy (Citation2010). Applying Theorem 2.1 to compute the optimal MSE, we see that E[(Y^Y)2]=e2(1eV).

Example 2.3

Uniform

For {Zt} with standard normal marginal, set g=Φ, so that {Xt} has a marginal distribution that is uniform on (0,1). Then by Remark 2.3, we find that g(k)(x)=ϕ(k1)(x), and hence Jk=1k!,E[ϕ(k1)(W)]=k1/2(1)k1E[Hk1(W)ϕ(W)],where WN(0,1).

Example 2.4

Logistic

Consider a logistic transform given by g(x)=ex/(1+ex). The first few derivatives of g are g(1)(x)=ex(1+ex)2g(2)(x)=(exe2x)(1+ex)3g(3)(x)=(ex4e2x+e3x)(1+ex)4.By Monte Carlo, we obtain J0=0.500, J1=0.207, J2=0, and J3=.025.

Example 2.5

Square

With g(x)=x2, we have J0=1, J1=0, J2=2, and Jk=0 for k>2. Then the optimal nonlinear predictor is E[Y|X_]=Z^2+V. It is simple to check that the error is ϵ=(ZZ^)(Z+Z^)V,which has mean zero and is independent of all functions of the data. The prediction MSE is 4V2V2.

3. Comparing linear and nonlinear prediction

It is of interest to understand how much benefit nonlinear prediction provides. Clearly, if g is affine (Example 2.1) then the minimal MSE is equal to σ2V, the same as the linear prediction, but we can expect gains to the degree that g differs from the affine case.

Remark 3.1

A related nonlinear predictor that is sometimes used in applications is defined via Y~=g(Z^), but unfortunately this estimator can be biased. Following the same arguments used in the proof of Theorem 2.1, YY~=k=0Jkk!tk(h(Z,t)h(Z^,t))|t=0=k=0Jkk!(exp{Z^tt2/2}×[exp{(ZZ^t}1])|t=0,so that the expectation of the quantity in parentheses is E[exp{Z^tt2/2}][exp{Vt2/2}1]>0.Hence there is no guarantee that the bias is zero.

More properly, a comparison can be made to the best linear predictor. When {Zt} is a strictly stationary time series, then {Xt} is as well, and we can in some cases determine the best linear estimator's MSE for comparison. Note that in this special case, the mean and variance of each variable Zt is the same, and hence without any loss of generality we may assume that {Zt} is standardised, i.e., each Zt is standard normal. Because Xt=g(Zt), it follows from Taniguchi and Kakizawa (Citation2000, p. 319) that E[Xt]=J0 and (8) γX(h)=k=0Jk2γZ(h)kJ02=k=1Jk2γZ(h)k,(8) where {γX(h)} and {γZ(h)} are the autocovariance sequences of {Xt} and {Zt}, respectively. So in principle we can understand the second order structure of {Xt} in terms of the Hermite coefficients and the original autocovariances.

Suppose further that we are interested in one-step ahead forecasting from a sample of size T. The best linear predictor is obtained by solving the Yule-Walker equations in {γX(h)}, and the MSE of such is given by (9) γX(0)[γX(1),,γX(T)]ΓX1[γX(1),,γX(T)],(9) where ΓX is the T-dimensional Toeplitz covariance matrix with jkth entry γX(jk). We know that such an MSE must be greater than the minimal MSE provided in Theorem 2.1, with equality occuring only in the case that a linear estimator is globally optimal (e.g., the time series is linear, or is Gaussian). If instead we are forecasting from an infinite past, then the MSE of the linear predictor is the innovation variance σ2 given by Kolmogorov's formula: (10) σ2=exp{(2π)1ππlogfX(λ)dλ}.(10) Here, fX(λ)=h=γX(h)eiλh is the spectral density of {Xt}. Therefore, for either a finite sample or for an infinite past, we can determine the linear predictor MSE for a stationary process {Xt} by first computing γX(h) from γZ(h) via (Equation8), followed by application of (Equation9) or (Equation10) as each case requires. As for the best (nonlinear) predictor, its MSE is given by (Equation7) of Theorem 2.1, where (11) V=γZ(0)[γZ(1),,γZ(T)]ΓZ1×[γZ(1),,γZ(T)](11) is the analogue of (Equation9) for the {Zt} process.

We provide an illustration in the case of an MA(1) process with various values of θ, and sample size T = 100. The innovation variance is set equal to (1+θ2)1 so that γZ(0)=1, as required by the above discussion. For the MA(1) process with T = 100, the value of V given by (Equation11) is the same (up to the fourth decimal place) as the innovation variance (1+θ2)1. For transformations, we study g(x)=x2, g(x)=ex, and the logistic. Observe that from (Equation8) the process {Xt} will be m-dependent if {Zt} is (although the converse need not be true). Hence, if we obtained a sample from {Xt} it would likely be identified with an MA(q) model, and the parameter estimates (e.g., obtained using a Whittle likelihood, which is valid for non-Gaussian processes so long as the cumulants are summable) would likely converge to those corresponding to the spectral factorisation of fX. Thus, our illustration provides an accurate rendition of the prediction MSE one would obtain in the case of linear or nonlinear predictors, only with the impact of parameter estimation error completely removed (Tables ).

Table 1. MSE for linear and non-linear predictors applied to a squared MA(1) process of parameter θ.

Table 2. MSE for linear and non-linear predictors applied to an exponential MA(1) process of parameter θ.

Table 3. MSE for linear and non-linear predictors applied to a logistic MA(1) process of parameter θ.

In each case, the degree of benefit to nonlinear prediction increases with θ, as to be expected; however, there are large discrepancies between the three functions. When θ=.8, the logistic transformation offers only a 3% improvement with nonlinear prediction, indicating that linear prediction is almost just as good as the conditional expectation. With g(x)=x2 the analogous improvement is 11%, and is 16% for g(x)=ex, indicating some real benefit to nonlinear prediction.

We end with a remark on how a confidence interval can be constructed using simulations. Let the formula given by (Equation6) be denoted h(Z^). Then the optimal prediction error can be written ε=g(Z^+δ)h(Z^),where δ=ZZ^ is the (linear) prediction error for the Gaussian variable, and is uncorrelated with (and hence independent of) Z^. Also δN(0,V). Therefore, in cases where it is easy to simulate Z^ (e.g., suppose we are forecasting from a Gaussian ARMA process) we can independently draw δ and compute ε for repeated Monte Carlo draws, thereby obtaining a confidence interval for Y.

Disclaimer

This report is released to inform interested parties of research and to encourage discussion. The views expressed on statistical issues are those of the authors and not those of the U.S. Census Bureau.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Tucker McElroy

Tucker McElroy is Senior Time Series Mathematical Statistician at the U.S. Census Bureau.

Srinjoy Das

Srinjoy Das is a Postdoctoral Scholar in the Mathematics department at the University of California, San Diego.

Notes

1 Other loss functions could of course be envisioned (mean absolute loss yields the conditional median, for example).

References

  • Brockett, P. L., Hinich, M. J., & Patterson, D. (1988). Bispectral-based tests for the detection of Gaussianity and linearity in time series.Journal of the American Statistical Association, 83(403), 657–664. https://doi.org/10.1080/01621459.1988.10478645
  • Brockwell, P. J., & Davis, R. A. (2013). Time series: Theory and methods. Springer Science & Business Media.
  • Janicki, R., & McElroy, T. (2016). Hermite expansion and estimation of monotonic transformations of Gaussian data. Journal of Nonparametric Statistics, 28(1), 207–234. https://doi.org/10.1080/10485252.2016.1139880
  • Maravall, A. (1983). An application of nonlinear time series forecasting. Journal of Business & Economic Statistics, 1(1), 66–74. https://doi.org/10.1080/07350015.1983.10509325
  • McElroy, T. (2010). A nonlinear algorithm for seasonal adjustment in multiplicative component decompositions. Studies in Nonlinear Dynamics and Econometrics, 14(4). Article 6. https://doi.org/10.2202/1558-3708.1756
  • McElroy, T. (2016). On the measurement and treatment of extremes in time series. Extremes, 19(3), 467–490. https://doi.org/10.1007/s10687-016-0254-4
  • McElroy, T., & McCracken, M. (2017). Multi-step ahead forecasting of vector time series. Econometric Reviews, 36(5), 495–513. https://doi.org/10.1080/07474938.2014.977088
  • McElroy, T., & Politis, D. (2020). Time series: A first course with bootstrap starter. Chapman Hall.
  • Roman, S. (1984). The umbral calculus. Academic Press.
  • Taniguchi, M., & Kakizawa, Y. (2000). Asymptotic theory of statistical inference for time series. Springer.
  • Varma, R. S. (1951). On Appell polynomials. Proceedings of the American Mathematical Society, 2(4), 593–596. https://doi.org/10.1090/S0002-9939-1951-0042547-5

Appendix

Proof of Theorem 2.1

From (Equation3) and (Equation5) we obtain E[Hk(Z)|X_]=1k!tkE[h(Z,t)|X_]|t=0.We can write Z|X_N(Z^,V), and using the property that ZZ^ is independent of all functions of the data, we obtain E[h(Z,t)|X_]=E[exp{ZtZ^t}exp{Z^tt2/2}|X_]=E[exp{ZtZ^t}|X_]exp{Z^tt2/2}=exp{Vt2/2}h(Z^,t).Hence E[Hk(Z)|X_]=1k!tk(h(Z^,t)eVt2/2)|t=0=1k!=0k(k)th(Z^,t)tkeVt2/2|t=0=1k!=0k(k)!H(Z^)κk.The prediction error is ε=YE[Y|X_]=k=0Jk(Hk(Z)E[Hk(Z)|X_])=k=1Jkk!tk(h(Z,t)E[h(Z,t)|X_])|t=0=k=0Jkk!tk(h(Z^,t)[exp{(ZZ^)t}exp{Vt2/2}])|t=0.Note that ZZ^ is orthogonal to all linear functions of the data; because this error is Gaussian, it is moreover independent of all functions of the data X_. It follows that E[ε]=0, because E[exp{(ZZ^)t}]=exp{Vt2/2}. Moreover, for any function (X_) of the data, E[ϵ(X_)]=k=0Jkk!tk(h(Z^,t)(X_)E[exp{(ZZ^)t}exp{Vt2/2}])|t=0=0.This verifies optimality. To compute the MSE, first observe that ε2=j,k=1JjJkj!k!sjtk(exp{Z^(s+t)(s2+t2)/2}[exp{(ZZ^)(s+t)}exp{Vt2/2+(ZZ^)s}exp{Vs2/2+(ZZ^)t}+exp{V(s2+t2)/2}])|s,t=0.Note that E[Z^]=E[Z]=0, because Z is standard normal. Moreover, due to orthogonality, Z=(ZZ^)+Z^ with the two summands orthogonal, and hence 1=E[Z2]=V+E[Z^2]. Using these facts and again using the independence property, we take the expectation of ε2 and obtain E[ε2]=j,k=1JjJkj!k!sjtk(exp{E[Z^](s+t)+E[Z^2](s+t)2/2(s2+t2)/2}(exp{V(s+t)2/2}exp{V(s2+t2)/2}))|s,t=0=j,k=1JjJkj!k!sjtk×(exp{(1V)(s+t)2/2(s2+t2)/2}(exp{V(s+t)2/2}exp{V(s2+t2)/2}))|s,t=0=j,k=1JjJkj!k!sjtk(exp{st}(1exp{Vst}))|s,t=0.Now, it is straight-forward to show that for any constant A, sjtkexp{Ast}|s,t=0={Akk!ifj=k0else.Applying this with A = 1 and A = −V yields E[ε2]=k=1Jk2k!(1(1V)k)k!=k=1Jk2(1(1V)k).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.