2,990
Views
8
CrossRef citations to date
0
Altmetric
Articles

A mixture autoregressive model based on Student’s t–distribution

ORCID Icon, ORCID Icon &
Pages 499-515 | Received 27 Aug 2020, Accepted 07 Apr 2021, Published online: 26 Apr 2021

Abstract

A new mixture autoregressive model based on Student’s t–distribution is proposed. A key feature of our model is that the conditional t–distributions of the component models are based on autoregressions that have multivariate t–distributions as their (low-dimensional) stationary distributions. That autoregressions with such stationary distributions exist is not immediate. Our formulation implies that the conditional mean of each component model is a linear function of past observations and the conditional variance is also time-varying. Compared to previous mixture autoregressive models our model may therefore be useful in applications where the data exhibits rather strong conditional heteroskedasticity. Our formulation also has the theoretical advantage that conditions for stationarity and ergodicity are always met and these properties are much more straightforward to establish than is common in nonlinear autoregressive models. An empirical example employing a realized kernel series constructed from S&P 500 high-frequency intraday data shows that the proposed model performs well in volatility forecasting. Our methodology is implemented in the freely available StMAR Toolbox for MATLAB.

1. Introduction

Different types of mixture models are in widespread use in various fields. Overviews of mixture models can be found, for example, in the monographs of McLachlan and Peel (Citation2000) and Frühwirth-Schnatter (Citation2006). In this paper, we are concerned with mixture autoregressive models that were introduced by Le, Martin, and Raftery (Citation1996) and further developed by Wong and Li (Citation2000, Citation2001a, Citation2001b) (for further references, see Kalliovirta, Meitz, and Saikkonen (Citation2015)).

In mixture autoregressive models the conditional distribution of the present observation given the past is a mixture distribution where the component distributions are obtained from linear autoregressive models. The specification of a mixture autoregressive model typically requires two choices: choosing a conditional distribution for the component models and choosing a functional form for the mixing weights. In a majority of existing models a Gaussian distribution is assumed whereas, in addition to constants, several different time-varying mixing weights (functions of past observations) have been considered in the literature.

Instead of a Gaussian distribution, Wong, Chan, and Kam (Citation2009) proposed using Student’s t–distribution. A major motivation for this comes from the heavier tails of the t–distribution which allow the resulting model to better accommodate for the fat tails encountered in many observed time series, especially in economics and finance. In the model suggested by Wong, Chan, and Kam (Citation2009), the conditional mean and conditional variance of each component model are the same as in the Gaussian case (a linear function of past observations and a constant, respectively), and what changes is the distribution of the independent and identically distributed error term: instead of a standard normal distribution, a Student’s t–distribution is used. This is a natural approach to formulate the component models and hence also a mixture autoregressive model based on the t–distribution.

In this paper, we also consider a mixture autoregressive model based on Student’s t–distribution, but our specification differs from that used by Wong, Chan, and Kam (Citation2009). Our starting point is the characteristic feature of linear Gaussian autoregressions that stationary distributions (of consecutive observations) as well as conditional distributions are Gaussian. We imitate this feature by using a (multivariate) Student’s t–distribution and, as a first step, construct a linear autoregression in which both conditional and (low-dimensional) stationary distributions have Student’s t–distributions. This leads to a model where the conditional mean is as in the Gaussian case (a linear function of past observations) whereas the conditional variance is no longer constant but depends on a quadratic form of past observations. These linear models are then used as component models in our new mixture autoregressive model which we call the StMAR model.

Our StMAR model has some very attractive features. Like the model of Wong, Chan, and Kam (Citation2009), it can be useful for modeling time series with leptokurtosis, regime switching, multimodality, persistence, and conditional heteroskedasticity. As the conditional variances of the component models are time-varying, the StMAR model can potentially accommodate for stronger forms of conditional heteroskedasticity than the model of Wong, Chan, and Kam (Citation2009). Our formulation also has the theoretical advantage that, for a pth order model, the stationary distribution of p + 1 consecutive observations is fully known and is a mixture of particular Student’s t–distributions. Moreover, stationarity and ergodicity are simple consequences of the definition of the model and do not require complicated proofs.

Finally, a few notational conventions. All vectors are treated as column vectors and we write x=(x1,,xn) for the vector x where the components xi may be either scalars or vectors. The notation Xnd(μ,Γ) signifies that the random vector X has a d–dimensional Gaussian distribution with mean μ and (positive definite) covariance matrix Γ. Similarly, by Xtd(μ,Γ,ν) we mean that X has a d–dimensional Student’s t–distribution with mean μ, (positive definite) covariance matrix Γ, and degrees of freedom ν (assumed to satisfy ν>2); the density function and some properties of the multivariate Student’s t–distribution employed are given in Appendix A. The notation 0d (1d) is used for a d–dimensional vector of zeros (ones), ıd signifies the vector (1,0,,0) of dimension d, and the identity matrix of dimension d is denoted by Id. The Kronecker product is denoted by ⊗, and vec(A) stacks the columns of matrix A on top of one another.

2. Linear Student’s t autoregressions

In order to formulate our new mixture model, this section briefly considers linear pth order autoregressions that have multivariate Student’s t–distributions as their stationary distributions. First, for motivation and to develop notation, consider a linear Gaussian autoregression zt (t=1,2,) generated by (1) zt=φ0+i=1pφizti+σet,(1) where the error terms et are independent and identically distributed with a standard normal distribution, and the parameters satisfy φ0R,φ=(φ1,,φp)Sp, and σ>0, where (2) Sp={(φ1,,φp)Rp:φ(z)=1i=1pφizi0 for |z|1}(2) is the stationarity region of a linear pth order autoregression. Denoting zt=(zt,,ztp+1) and zt+=(zt,zt1), it is well known that the stationary solution zt to (1) satisfies (3) ztnp(μ1p,Γp),zt+np+1(μ1p+1,Γp+1),zt|zt1n1(φ0+φzt1,σ2)=n1(μ+γpΓp1(zt1μ1p),σ2),(3) where the last relation defines the conditional distribution of zt given zt1 and the quantities Γp, γ0, γp, μ, and Γp+1 are defined via (4) vec(Γp)=(Ip2(ΦΦ))1ıp2σ2,Φ=[φ1φp1φpIp10p1],γ0=σ2+φΓpφ,γp=Γpφ,μ=φ0/(1φ1φp),Γp+1=[γ0γpγpΓp].(4)

Two essential properties of linear Gaussian autoregressions are that they have the distributional features in Equation(3) and the representation in Equation(1).

It is not immediately obvious that linear autoregressions based on Student’s t–distribution with similar properties exist (such models have, however, appeared at least in Spanos (Citation1994), Heracleous and Spanos (Citation2006), and Pitt and Walker (Citation2006)). Suppose that for a random vector in Rp+1 it holds that (z,z)tp+1(μ1p+1,Γp+1,ν) where ν>2 (and other notation is as above in Equation(4)). Then (for details, see Appendix A) the conditional distribution of z given z is z|zt1(μ(z),σ2(z),ν+p), where (5) μ(z)=φ0+φz,σ2(z)=ν2+(zμ1p)Γp1(zμ1p)ν2+pσ2.(5)

We now state the following theorem (proofs of all theorems are in Appendix B).

Theorem 1.

Suppose φ0R,φ=(φ1,,φp)Sp, σ>0, and ν>2. Then there exists a process zt=(zt,,ztp+1) (t=0,1,2,) with the following properties.

  1. The process zt (t=1,2,) is a Markov chain on Rp with a stationary distribution characterized by the density function tp(μ1p,Γp,ν). When z0tp(μ1p,Γp,ν), we have, for t=1,2,, that zt+tp+1(μ1p+1,Γp+1,ν) and the conditional distribution of zt given zt1 is(6) zt|zt1t1(μ(zt1),σ2(zt1),ν+p).(6)

  2. Furthermore, for t=1,2,, the process zt has the representation(7) zt=φ0+i=1pφi zti+σtεt(7)

with conditional variance σt2=σ2(zt1) (see (5)), where the error terms εt form a sequence of independent and identically distributed random variables with a marginal t1(0,1,ν+p) distribution and with εt independent of {zs,s<t}.

Results (i) and (ii) in Theorem 1 are comparable to properties Equation(3) and Equation(1) in the Gaussian case. Part (i) shows that both the stationary and conditional distributions of zt are t–distributions, whereas part (ii) clarifies the connection to standard AR(p) models. In contrast to linear Gaussian autoregressions, in this t–distributed case zt is conditionally heteroskedastic and has an ‘AR(p)–ARCH(p)’ representation (here ARCH refers to autoregressive conditional heteroskedasticity).

3. A mixture autoregressive model based on Student’s t–distribution

3.1. Mixture autoregressive models

Let yt (t=1,2,) be the real-valued time series of interest, and let Ft1 denote the σ–algebra generated by {ytj, j>0}. We consider mixture autoregressive models for which the conditional density function of yt given its past, f(·|Ft1), is of the form (8) f(yt|Ft1)=m=1Mαm,tfm(yt|Ft1),(8) where the (positive) mixing weights αm,t are Ft1–measurable and satisfy m=1Mαm,t=1 (for all t), and the fm(·|Ft1),m=1,,M, describe the conditional densities of M autoregressive component models. Different mixture models are obtained with different specifications of the mixing weights αm,t and the conditional densities fm(·|Ft1).

Starting with the specification of the conditional densities fm(·|Ft1), a common choice has been to assume the component models to be linear Gaussian autoregressions. For the mth component model (m=1,,M), denote the parameters of a pth order linear autoregression with φm,0R,φm=(φm,1,,φm,p)Sp, and σm>0. Also set yt1=(yt1,,ytp). In the Gaussian case, the conditional densities in Equation(8) take the form (m=1,,M) fm(yt|Ft1)=1σmϕ(ytμm,tσm), where ϕ(·) signifies the density function of a standard normal random variable, μm,t=φm,0+φmyt1 is the conditional mean function (of component m), and σm2>0 is the conditional variance (of component m), often assumed to be constant. Instead of a Gaussian density, Wong, Chan, and Kam (Citation2009) considered the case where fm(·|Ft1) is the density of Student’s t–distribution with conditional mean and variance as above, μm,t=φm,0+φmyt1 and a constant σm2, respectively.

In this paper, we also consider a mixture autoregressive model based on Student’s t–distribution, but our formulation differs from that used by Wong, Chan, and Kam (Citation2009). In Theorem 1 it was seen that linear autoregressions based on Student’s t–distribution naturally lead to the conditional distribution t1(μ(·),σ2(·),ν+p) in Equation(6). Motivated by this, we consider a mixture autoregressive model in which the conditional densities fm(yt|Ft1) in Equation(8) are specified as (9) fm(yt|Ft1)=t1(yt;μm,t,σm,t2,νm+p),(9) where the expressions for μm,t=μm(yt1) and σm,t2=σm2(yt1) are as in (5) except that z is replaced with yt1 and all the quantities therein are defined using the regime specific parameters φm,0,φm, σm, and νm (whenever appropriate a subscript m is added to previously defined notation, e.g., μm or Γm,p). A key difference to the model of Wong, Chan, and Kam (Citation2009) is that the conditional variance of component m is not constant but a function of yt1. An explicit expression for the density in (9) can be obtained from Appendix A and is (10) fm(yt|Ft1)=C(νm)σm,t1(1+(νm+p2)1(ytμm,tσm,t)2)1+νm+p2,(10) where C(ν)=Γ((1+ν+p)/2)(π(ν+p2))1/2Γ((ν+p)/2) (and Γ(·) signifies the gamma function).

Now consider the choice of the mixing weights αm,t in Equation(8). The most basic choice is to use constant mixing weights as in Wong and Li (Citation2000) and Wong, Chan, and Kam (Citation2009). Several different time-varying mixing weights have also been suggested, see, e.g., Wong and Li (Citation2001a), Glasbey (Citation2001), Lanne and Saikkonen (Citation2003), Dueker, Sola, and Spagnolo (Citation2007), and Kalliovirta, Meitz, and Saikkonen (Citation2015, Citation2016).

In this paper, we propose mixing weights that are similar to those used by Glasbey (Citation2001) and Kalliovirta, Meitz, and Saikkonen (Citation2015). Specifically, we set (11) αm,t=αmtp(yt1;μm1p,Γm,p,νm)n=1Mαntp(yt1;μn1p,Γn,p,νn),(11) where the αm(0,1),m=1,,M, are unknown parameters satisfying m=1Mαm=1. Note that the Student’s t density appearing in Equation(11) corresponds to the stationary distribution in Theorem 1(i): If the yt’s were generated by a linear Student’s t autoregression described in Section 2 (with a subscript m added to all the notation therein), the stationary distribution of yt1 would be characterized by tp(yt1;μm1p,Γm,p,νm). Our definition of the mixing weights in Equation(11) is different from that used in Glasbey (Citation2001) and Kalliovirta, Meitz, and Saikkonen (Citation2015) in that these authors employed the np(yt1;μm1p,Γm,p) density (corresponding to the stationary distribution of a linear Gaussian autoregression) instead of the Student’s t density tp(yt1;μm1p,Γm,p,νm) we use.

3.2. The Student’s t mixture autoregressive model

EquationEquations (8), Equation(9), and Equation(11) define a model we call the Student’s t mixture autoregressive, or StMAR, model. When the autoregressive order p or the number of mixture components M need to be emphasized we refer to an StMAR(p,M) model. We collect the unknown parameters of an StMAR model in the vector θ=(ϑ1,,ϑM,α1,,αM1) ((M(p+4)1)×1), where ϑm=(φm,0,φm,σm2,νm) (with φmSp,σm2>0, and νm>2) contains the parameters of each component model (m=1,,M) and the αm’s are the parameters appearing in the mixing weights Equation(11); the parameter αM is not included due to the restriction m=1Mαm=1.

The StMAR model can also be presented in an alternative (but equivalent) form. To this end, let Pt1(·) signify the conditional probability of the indicated event given Ft1, and let εm,t be a sequence of independent and identically distributed random variables with a t1(0,1,νm+p) distribution such that εm,t is independent of {ytj,j>0} (m=1,,M). Furthermore, let st=(s1,t,,sM,t) be a sequence of (unobserved) M–dimensional random vectors such that, conditional on Ft1,st and εm,t are independent (for all m). The components of st are such that, for each t, exactly one of them takes the value one and others are equal to zero, with conditional probabilities Pt1(sm,t=1)=αm,t, m=1,,M. Now yt can be expressed as (12) yt=m=1Msm,t(μm,t+σm,tεm,t)=m=1Msm,t(φm,0+φmyt1+σm,tεm,t),(12) where σm,t is as in (9). This formulation suggests that the mixing weights αm,t can be thought of as (conditional) probabilities that determine which one of the M autoregressive components of the mixture generates the observation yt.

It turns out that the StMAR model has some very attractive theoretical properties; the carefully chosen conditional densities in Equation(9) and the mixing weights in Equation(11) are crucial in obtaining these properties. The following theorem shows that there exists a choice of initial values y0 such that yt is a stationary and ergodic Markov chain. Importantly, an explicit expression for the stationary distribution is also provided.

Theorem 2.

Consider the StMAR process yt generated by Equation(8), Equation(9), and Equation(11) (or Equation(12) and Equation(11)) with the conditions φmSp and νm>2 satisfied for all m=1,,M. Then yt=(yt,,ytp+1) (t=1,2,) is a Markov chain on Rp with a stationary distribution characterized by the density f(y;θ)=m=1Mαmtp(y;μm1p,Γm,p,νm). Moreover, yt is ergodic.

As can be seen from the proof of Theorem 2 (in Appendix B), the Markov property, stationarity, and ergodicity are obtained as reasonably simple consequences of the definition of the StMAR model. The stationary distribution of yt is a mixture of M p–dimensional t–distributions with constant mixing weights αm. Hence, moments of the stationary distribution of order smaller than min(ν1,,νM) exist and are finite. Furthermore, the stationary distribution of the vector (yt,yt1) is also a mixture of M t–distributions with density of the same form, m=1Mαmtp+1(μm1p+1,Γm,p+1,νm) (for details, see Appendix B). Thus the mean, variance, and first p autocovariances of yt are (here the connection between γm,j and Γm,p+1 is as in Equation(4)) μ=defE[yt]=m=1Mαmμm,γj=defCov[yt,ytj]=m=1Mαmγm,j+m=1Mαm(μmμ)2,j=0,,p.

Subvectors of (yt,yt1) also have stationary distributions that belong to the same family (but this does not hold for higher dimensional vectors such as (yt+1,yt,yt1)).

The fact that an explicit expression for the stationary (marginal) distribution of the StMAR model is available is not only convenient but also quite exceptional among mixture autoregressive models or other related nonlinear autoregressive models (such as threshold or smooth transition models). Previously, similar results have been obtained by Glasbey (Citation2001) and Kalliovirta, Meitz, and Saikkonen (Citation2015) in the context of mixture autoregressive models that are of the same form but based on the Gaussian distribution (for a few rather simple first order examples involving other models, see Tong (Citation2011, Section 4.2)).

From the definition of the model, the conditional mean and variance of yt are obtained as (13) E[yt|Ft1]=m=1Mαm,tμm,t,Var[yt|Ft1]=m=1Mαm,tσm,t2+m=1Mαm,t(μm,tn=1Mαn,tμn,t)2.(13)

Except for the different definition of the mixing weights, the conditional mean is as in the Gaussian mixture autoregressive model of Kalliovirta, Meitz, and Saikkonen (Citation2015). This is due to the well-known fact that in the multivariate t–distribution the conditional mean is of the same linear form as in the multivariate Gaussian distribution. However, unlike in the Gaussian case, the conditional variance of the multivariate t–distribution is not constant. Therefore, in Equation(13) we have the time-varying variance component σm,t2 which in the models of Kalliovirta, Meitz, and Saikkonen (Citation2015) and Wong, Chan, and Kam (Citation2009) is constant (in the latter model the mixing weights are also constants). In Equation(13) both the mixing weights αm,t and the variance components σm,t2 are functions of yt1, implying that the conditional variance exhibits nonlinear autoregressive conditional heteroskedasticity. Compared to the aforementioned previous models our model may therefore be useful in applications where the data exhibits rather strong conditional heteroskedasticity.

In many applications in economics, finance, and other fields, the data is often multimodal and contains periods with markedly different behaviors. In such a situation a multiple regime StMAR model would be more appropriate than a linear model. This applies also to the StMAR model with a single regime (M = 1) which corresponds to the linear Student’s t autoregression considered in Section 2. Furthermore, the conditional mean and variance are much more flexible in a mixture model than in a linear one.

4. Estimation

The parameters of an StMAR model can be estimated by the method of maximum likelihood (details of the numerical optimization methods employed and of simulation experiments are available in the Supplementary Appendix). As the stationary distribution of the StMAR process is known it is even possible to make use of initial values and construct the exact likelihood function and obtain exact maximum likelihood estimates. Assuming the observed data yp+1,,y0,y1,,yT and stationary initial values, the log-likelihood function takes the form (14) LT(θ)=log(m=1Mαmtp(y0;μm1p,Γm,p,νm))+t=1Tlt(θ),(14) where (15) lt(θ)=log(m=1Mαm,tt1(yt;μm,t,σm,t2,νm+p)).(15)

An explicit expression for the density appearing in Equation(15) is given in Equation(10), and the notation for μm,t and σm,t2 is explained after Equation(9). Although not made explicit, αm,t,μm,t, and σm,t2, as well as the quantities μm, γm,p, and Γm,p, depend on the parameter vector θ.

In Equation(14) it has been assumed that the initial values y0 are generated by the stationary distribution. If this assumption seems inappropriate one can condition on initial values and drop the first term on the right hand side of Equation(14). In what follows we assume that estimation is based on this conditional log-likelihood, namely LT(c)(θ)=T1t=1Tlt(θ) which we, for convenience, have also scaled with the sample size. Maximizing LT(c)(θ) with respect to θ yields the maximum likelihood estimator denoted by θ̂T.

The permissible parameter space of θ, denoted by Θ, needs to be constrained in various ways. The stationarity conditions φmSp, the positivity of the variances σm2, and the conditions νm>2 ensuring existence of second moments are all assumed to hold (for m=1,,M). Throughout we assume that the number of mixture components M is known, and this also entails the requirement that the parameters αm (m=1,,M) are strictly positive (and strictly less than unity whenever M > 1). Further restrictions are required to ensure identification. Denoting the true parameter value by θ0 and assuming stationary initial values, the condition needed is that lt(θ)=lt(θ0) almost surely only if θ=θ0. An additional assumption needed for this is (16) α1>>αM>0 and ϑi=ϑj only if 1i=jM.(16)

From a practical point of view this assumption is not restrictive because what it essentially requires is that the M component models cannot be ‘relabeled’ and the same StMAR model obtained. We summarize the restrictions imposed on the parameter space as follows.

Assumption 1.

The true parameter value θ0 is an interior point of Θ, where Θ is a compact subset of {θ=(ϑ1,,ϑM,α1,,αM1)RM(p+3)×(0,1)M1:φmSp, σm2>0,and νm>2  forall m=1,,M, and(16)holds}.

Asymptotic properties of the maximum likelihood estimator can now be established under conventional high-level conditions. Denote I(θ)=E[lt(θ)θlt(θ)θ] and J(θ)=E[2lt(θ)θθ].

Theorem 3.

Suppose yt is generated by the stationary and ergodic StMAR process of Theorem 2 and that Assumption 1 holds. Then θ̂T is strongly consistent, i.e., θ̂Tθ0 almost surely. Suppose further that (i) T1/2θLT(c)(θ0)dN(0,I(θ0)) with I(θ0) finite and positive definite, (ii) J(θ0)=I(θ0), and (iii) E[supθΘ0|2lt(θ)θθ|]< for some Θ0, a compact convex set contained in the interior of Θ that has θ0 as an interior point. Then T1/2(θ̂Tθ0)dN(0,J(θ0)1).

Of the conditions in this theorem, (i) states that a central limit theorem holds for the score vector (evaluated at θ0) and that the information matrix is positive definite, (ii) is the information matrix equality, and (iii) ensures the uniform convergence of the Hessian matrix (in some neighborhood of θ0). These conditions are standard but their verification may be tedious.

Theorem 3  shows that the conventional limiting distribution applies to the maximum likelihood estimator θ̂T, which implies the applicability of standard likelihood-based tests. It is worth noting, however, that here a correct specification of the number of autoregressive components M is required. In particular, if the number of component models is chosen too large then some parameters of the model are not identified and, consequently, the result of Theorem 3 and the validity of the related tests break down. This particularly happens when one tests for the number of component models. Such tests for mixture autoregressive models with Gaussian conditional densities (see Equation(8)) are developed by Meitz and Saikkonen (Citation2021). The testing problem is highly nonstandard and extending their results to the present case is beyond the scope of this paper.

Instead of formal tests, in our empirical application we take a pragmatic approach and resort to the use of information criteria to infer which model fits the data best. Similar approaches have also been used by Wong, Chan, and Kam (Citation2009) and others. Note that once the number of regimes is (correctly) chosen, standard likelihood-based inference can be used to choose regime-wise autoregressive orders and to test other hypotheses of interest. Validity of (quantile) residual-based misspecification tests to check for model adequacy also relies on the correct specification of the number of regimes.

5. Empirical example

Modeling and forecasting financial market volatility is key to manage risk. In this application we use the realized kernel of Barndorff-Nielsen et al. (Citation2008) as a proxy for latent volatility. We obtained daily realized kernel data over the period 3 January 2000 through 20 May 2016 for the S&P 500 index from the Oxford-Man Institute’s Realized Library v0.2 (Heber et al. Citation2009). shows the in-sample period (Jan 3, 2000–June 3, 2014; 3597 observations) for the S&P 500 realized kernel data (RKt), which is nonnegative with a distribution exhibiting substantial skewness and excess kurtosis (sample skewness 14.3, sample kurtosis 380.8). We follow the related literature which frequently use logarithmic realized kernel (log(RKt)), to avoid imposing additional parameter constraints, and to obtain a more symmetric distribution, often taken to be approximately Gaussian. The log(RKt) data, also shown in , has a sample skewness of 0.5 and kurtosis of 3.5. Visual inspection of the time series plots of the RKt and log(RKt) data suggests that the two series exhibit changes at least in levels and potentially also in variability. A kernel estimate of the density function of the log(RKt) series also suggest the potential presence of multiple regimes.

Figure 1. Left panel: Daily RKt (lower solid) and log(RKt) (upper solid), and mixing weights based on the estimates of the StMAR(4,2) model in (dot-dash) for the log(RKt) series. The mixing weights α̂1,t are scaled from (0, 1) to (minlog(RKt), maxlog(RKt)). Right panel: A kernel density estimate of the log(RKt) observations (solid), and the mixture density (dashes) implied by the same StMAR model as in the left panel.

Figure 1. Left panel: Daily RKt (lower solid) and log (RKt) (upper solid), and mixing weights based on the estimates of the StMAR(4,2) model in Table 1 (dot-dash) for the log (RKt) series. The mixing weights α̂1,t are scaled from (0, 1) to (min log (RKt), max log (RKt)). Right panel: A kernel density estimate of the log (RKt) observations (solid), and the mixture density (dashes) implied by the same StMAR model as in the left panel.

For brevity, we focus our attention on StMAR models with p=1,2,3,4 and M = 1, 2, 3; higher-order models were also tried but their forecasting performance was qualitatively similar to the models with p4. Following Wong and Li (Citation2001a), Wong, Chan, and Kam (Citation2009), and Li et al. (Citation2015), we use information criteria for model comparison. Of these models, the Akaike information criterion (aic) and the Hannan-Quinn information criterion (hqc) favor the StMAR(4,3) model, and the Bayesian information criterion (bic) the simpler StMAR(4,1) model. Estimation results for these two models, as well as the intermediate StMAR(4,2) model, are reported in . As the estimated mixture weight of the third component of the StMAR(4,3) model is rather small (α̂30.023) and the first two components are very similar to the StMAR(4,2) model, including this intermediate StMAR(4,2) model seems reasonable. In view of the approximate standard errors in , the estimation accuracy appears quite reasonable except for the degrees of freedom parameters (for large values of the degrees of freedom parameters the likelihood function becomes very flat; in particular ν̂3 and its standard error may be rather inaccurate). Taking the sum of the autoregressive parameters as a measure of persistence, we find that the estimated persistence for the first regime of the StMAR(4,2) is 0.909 and 0.489 for the second regime, suggesting that persistence is rather strong in the first regime and moderate in the second regime.

Table 1. Parameter estimates for three selected StMAR models and the log(RKt) data over the period 3 January 2000–3 June 2014.

Numerous alternative models for volatility proxies have been proposed. We employ Corsi’s (Citation2009) heterogeneous autoregressive (HAR) model as it is arguably the most popular reference model for forecasting proxies such as the realized kernel. We also consider a pth-order autoregression as the AR(p) often performs well in volatility proxy forecasting. The StMAR models are estimated using maximum likelihood, and the reference AR and HAR models by ordinary least squares. We use a fixed scheme, where the parameters of our volatility models are estimated just once using data from Jan 3, 2000–June 3, 2014. These estimates are then used to generate all forecasts. The remaining 496 observations of our sample are used to compare the forecasts from the alternative models. As discussed in Kalliovirta, Meitz, and Saikkonen (Citation2016), computing multi-step-ahead forecasts for mixture models like the StMAR is rather complicated. For this reason we use computer driven forecasts to predict future volatility: For each out-of-sample date T, and for each alternative model, we simulate 1,000,000 sample paths. Each path is of length 22 (representing one trading month) and conditional on the information available at date T. In these simulations unknown parameters are replaced by their estimates. As the simulated paths are for log(RKt), and our object of interest is RKt, an exponential transformation is applied.

We examine daily, weekly (5 day), biweekly (10 day), and monthly (22 day) volatility forecasts generated by the alternative models; for instance, the weekly volatility forecast at date T is the forecast for RKT+1++RKT+5 (the 5-day-ahead cumulative realized kernel). reports the percentage shares of (1, 5, 10, and 22-day) cumulative RKt out-of-sample observations that belong to the 99%, 95%, and 90% one-sided upper prediction intervals based on the distribution of the simulated sample paths; these upper prediction intervals for volatility are related to higher levels of risk in financial markets. Overall, it is seen that the empirical coverage rates of the StMAR based prediction intervals are closer to the nominal levels than those obtained with the reference models. By comparison, the accuracy of the prediction intervals obtained with the popular HAR model quickly degrade as the forecast period increases. The StMAR model performs well also when two-sided prediction intervals and point forecast accuracy are considered (for details, see the Supplementary Appendix).

Table 2. The percentage shares of cumulative realized kernel observations that belong to the 99%, 95% and 90% one-sided upper prediction intervals based on the distribution of 1,000,000 simulated conditional sample paths.

Supplemental material

Supplemental Material

Download PDF (395.7 KB)

Acknowledgments

The authors thank the Academy of Finland for financial support, and the editors and an anonymous referee for useful comments and suggestions.

References

  • Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde, and N. Shephard. 2008. Designing realized kernels to measure the ex post variation of equity prices in the presence of noise. Econometrica 76:1481–536.
  • Corsi, F. 2009. A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics 7 (2):174–96. doi:10.1093/jjfinec/nbp001.
  • Ding, P. 2016. On the conditional distribution of the multivariate t distribution. The American Statistician 70 (3):293–95. doi:10.1080/00031305.2016.1164756.
  • Dueker, M. J., M. Sola, and F. Spagnolo. 2007. Contemporaneous threshold autoregressive models: Estimation, testing and forecasting. Journal of Econometrics 141 (2):517–47. doi:10.1016/j.jeconom.2006.10.022.
  • Frühwirth-Schnatter, S. 2006. Finite mixture and Markov switching models. New York: Springer.
  • Glasbey, C. A. 2001. Non-linear autoregressive time series with multivariate Gaussian mixtures as marginal distributions. Journal of the Royal Statistical Society: Series C (Applied Statistics) 50 (2):143–54. doi:10.1111/1467-9876.00225.
  • Heber, G., A. Lunde, N. Shephard, and K. Sheppard. 2009. Oxford-man institute’s realized library v0.2. Oxford: Oxford-Man Institute, University of Oxford.
  • Heracleous, M. S., and A. Spanos. 2006. The Student’s t dynamic linear regression: Re-examining volatility modeling. In Econometric analysis of financial and economic time series (advaces in econometrics, Vol. 20 Part 1), ed. D. Terrell and T. B. Fomby, 289–319. Oxford: Emerald Group Publishing Limited.
  • Holzmann, H.,. A. Munk, and T. Gneiting. 2006. Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics 33 (4):753–63. doi:10.1111/j.1467-9469.2006.00505.x.
  • Kalliovirta, L., M. Meitz, and P. Saikkonen. 2015. A Gaussian mixture autoregressive model for univariate time series. Journal of Time Series Analysis 36 (2):247–66. doi:10.1111/jtsa.12108.
  • Kalliovirta, L., M. Meitz, and P. Saikkonen. 2016. Gaussian mixture vector autoregression. Journal of Econometrics 192 (2):485–98. doi:10.1016/j.jeconom.2016.02.012.
  • Kotz, S., and S. Nadarajah. 2004. Multivariate t distributions and their applications. Cambridge: Cambridge University Press.
  • Lanne, M., and P. Saikkonen. 2003. Modeling the US short-term interest rate by mixture autoregressive processes. Journal of Financial Econometrics 1 (1):96–125. doi:10.1093/jjfinec/nbg004.
  • Le, N. D., R. D. Martin, and A. E. Raftery. 1996. Modeling flat stretches, bursts, and outliers in time series using mixture transition distribution models. Journal of the American Statistical Association 91:1504–15.
  • Li, G., B. Guan, W. K. Li, and P. L. Yu. 2015. Hysteretic autoregressive time series models. Biometrika 102 (3):717–23. doi:10.1093/biomet/asv017.
  • McLachlan, G., and D. Peel. 2000. Finite mixture models. New York: Wiley.
  • Meitz, M., and P. Saikkonen. 2021. Testing for observation-dependent regime switching in mixture autoregressive models. Journal of Econometrics 222 (1):601–24. doi:10.1016/j.jeconom.2020.04.048.
  • Meyn, S., and R. L. Tweedie. 2009. Markov chains and stochastic stability. 2nd ed. Cambridge: Cambridge University Press.
  • Pitt, M. K., and S. G. Walker. 2006. Extended constructions of stationary autoregressive processes. Statistics & Probability Letters 76 (12):1219–24. doi:10.1016/j.spl.2005.12.020.
  • Ranga Rao, R. 1962. Relations between weak and uniform convergence of measures with applications. The Annals of Mathematical Statistics 33 (2):659–80. doi:10.1214/aoms/1177704588.
  • Spanos, A. 1994. On modeling heteroskedasticity: The Student’s t and elliptical linear regression models. Econometric Theory 10 (2):286–315. doi:10.1017/S0266466600008422.
  • Tong, H. 2011. Threshold models in time series analysis – 30 years on. Statistics and Its Interface 4 (2):107–18. doi:10.4310/SII.2011.v4.n2.a1.
  • Wong, C. S., W. S. Chan, and P. L. Kam. 2009. A Student t-mixture autoregressive model with applications to heavy-tailed financial data. Biometrika 96 (3):751–60. doi:10.1093/biomet/asp031.
  • Wong, C. S., and W. K. Li. 2000. On a mixture autoregressive model. Journal of Royal Statistical Society B 62:95–115.
  • Wong, C. S., and W. K. Li. 2001a. On a logistic mixture autoregressive model. Biometrika 88 (3):833–46. doi:10.1093/biomet/88.3.833.
  • Wong, C. S., and W. K. Li. 2001b. On a mixture autoregressive conditional heteroscedastic model. Journal of the American Statistical Association 96 (455):982–95. doi:10.1198/016214501753208645.

Appendices

Appendix A: Properties of the multivariate Student’s t–distribution

The standard form of the density function of the multivariate Student’s t–distribution with ν degrees of freedom and dimension d is (see, e.g., Kotz and Nadarajah (Citation2004, p. 1)) f(x)=Γ((d+ν)/2)(πν)d/2Γ(ν/2)det(Σ)1/2(1+ν1(xμ)Σ1(xμ))d+ν2, where Γ(·) is the gamma function and μRd and Σ (d × d), a symmetric positive definite matrix, are parameters. For a random vector X possessing this density, the mean and covariance are E[X]=μ and Cov[X]=Γ=νν2Σ (assuming ν>2). The density can be expressed in terms of μ and Γ as f(x)=Γ((d+ν)/2)(π(ν2))d/2Γ(ν/2)det(Γ)1/2(1+(ν2)1(xμ)Γ1(xμ))d+ν2.

This form of the density function, denoted by td(x;μ,Γ,ν), is used in this paper, and the notation Xtd(μ,Γ,ν) is used for a random vector X possessing this density. Condition ν>2 and positive definiteness of Γ will be tacitly assumed.

For marginal and conditional distributions, partition X as X=(X1,X2) where the components have dimensions d1 and d2 (d1+d2=d). Conformably partition μ and Γ as μ=(μ1,μ2) and Γ=[Γ11Γ12Γ12Γ22].

Then the marginal distributions of X1 and X2 are td1(μ1,Γ11,ν) and td2(μ2,Γ22,ν), respectively. The conditional distribution of X1 given X2 is also a t–distribution, namely (see Ding (Citation2016, Sec. 2)) X1|(X2=x2)td1(μ1|2(x2),Γ1|2(x2),ν+d2), where μ1|2(x2)=μ1+Γ12Γ221(x2μ2) and Γ1|2(x2)=ν2+(x2μ2)Γ221(x2μ2)ν2+d2(Γ11Γ12Γ221Γ12). Furthermore, td(x;μ,Γ,ν)=td1(x1;μ1|2(x2),Γ1|2(x2),ν+d2)td2(x2;μ2,Γ22,ν).

Now consider a special case: a (p + 1)–dimensional random vector Xtp+1(μ1p+1,Γp+1,ν), where μR and Γp+1 is a symmetric positive definite Toeplitz matrix. Note that the mean vector μ1p+1 and the covariance matrix Γp+1 have structures similar to those of the mean and covariance matrix of a (p + 1)–dimensional realization of a second order stationary process. More specifically, assume that Γp+1 is the covariance matrix of a second order stationary AR(p) process.

Partition X as X=(X1,X2)=(X1,Xp+1) with X1 and Xp+1 real valued and X1 and X2 both p×1 vectors. The marginal distributions of X1 and X2 are X1tp(μ1p,Γp,ν) and X2tp(μ1p,Γp,ν), where the (symmetric positive definite Toeplitz) matrix Γp=Cov[X1]=Cov[X2] is obtained from Γp+1 by deleting the first row and first column or, equivalently, the last row and last column (here the specific structures of μ1p+1 and Γp+1 are used). The conditional distribution of X1 given X2=x2 is X1|(X2=x2)t1(μ(x2),σ2(x2),ν+p), where expressions for μ(x2) and σ2(x2) can be obtained from above as follows. Partition Γp+1 as Γp+1=[γ0γpγpΓp], and denote φ=Γp1γp and σ2=γ0γpΓp1γp (σ2>0 as Γp+1 is positive definite). From above, μ(x2)=μ1|2(x2)=μ+γpΓp1(x2μ1p)=μ(1γpΓp11p)+φx2,σ2(x2)=Γ1|2(x2)=ν2+(x2μ1p)Γp1(x2μ1p)ν2+pσ2.

Appendix B: Proofs of Theorems 1–3

Proof of Theorem 1.

Corresponding to φ0R,φ=(φ1,,φp)Sp, σ>0, and ν>2, define the notation Γp, γ0, γp, μ, and Γp+1 as in Equation(4), and note that Γp and Γp+1 are, by construction and due to assumption φSp, symmetric positive definite Toeplitz matrices. To prove (i), we will construct a p–dimensional Markov process zt=(zt,,ztp+1) (t=1,2,) with the desired properties. We need to specify an appropriate transition probability measure and an initial distribution. For the former, assume that the transition probability measure of zt is determined by the density function t1(zt;μ(zt1),σ2(zt1),ν+p), where μ(zt1) and σ2(zt1) are obtained from the last two displayed equations in Appendix A by substituting zt1 for x2. This shows that zt can be treated as a Markov chain (see Meyn and Tweedie (Citation2009, Ch. 3)). Concerning the initial value z0, suppose it follows the t–distribution z0tp(μ1p,Γp,ν). Furthermore, if zt+=(zt,zt1)=(zt,ztp), we find from Appendix A that the density function of z1+ is given by (A1) tp+1(z1+,μ1p+1,Γp+1,ν)=t1(z1;μ(z0),σ2(z0),ν+p)tp(z0;μ1p,Γp,ν).(A1)

Thus, z1+tp+1(μ1p+1,Γp+1,ν) and, as in Appendix A, it follows that the marginal distribution of z1 is the same as that of z0, that is, z1tp(μ1p,Γp,ν) (the specific structure of Γp+1 is used here). Hence, as zt is a Markov chain, we can conclude that it has a stationary distribution characterized by the density function tp(z,μ1p,Γp,ν) (see Meyn and Tweedie (Citation2009, pp. 230–231)). This completes the proof of (i).

To prove (ii), note that, due to the Markov property, zt|Ft1zt1(μ(zt1),σ2(zt1),ν+p) where Ft1z signifies the sigma-algebra generated by {zs,s<t}. Thus we can write the conditional expectation and conditional variance of zt given Ft1z as E[zt|Ft1z]=E[zt|zt1]=μ+γpΓp1(zt1μ1p)=φ0+φzt1,Var[zt|Ft1z]=Var[zt|zt1]=ν2+(zt1μ1p)Γp1(zt1μ1p)ν2+pσ2.

Denote this conditional variance by σt2=σ2(zt1) (and note that σt2>0 a.s. due to the assumed conditions σ2>0,Γp>0, and ν>2). Now the random variables εt defined by εt=def(ztφ0φzt1)/σt follow, conditional on Ft1z, the t1(0,1,ν+p) distribution. Hence, we obtain the ‘AR(p)–ARCH(p)’ representation (7). Because the conditional distribution εt|Ft1zt1(0,1,ν+p) does not depend on Ft1z (or, more specifically, on the random variables {zs,s<t}), the same holds true also unconditionally, εtt1(0,1,ν+p), implying that the random variables εt are independent of Ft1z (or of {zs,s<t}). Moreover, from the definition of the εt’s it follows that {εs,s<t} is a function of {zs,s<t}, and hence εt is also independent of {εs,s<t}. Consequently, the random variables εt are IID t1(0,1,ν+p), completing the proof of (ii). □

Proof of Theorem 2.

First note that yt is a Markov chain on Rp. Now, let y0=(y0,,yp+1) be a random vector whose distribution has the density f(y0;θ)=m=1Mαmtp(y0;μm1p,Γm,p,νm). According to (8, 9, 11), and (A1), the conditional density of y1 given y0 is f(y1|y0;θ)=m=1Mαmtp(y0;μm1p,Γm,p,νm)n=1Mαntp(y0;μn1p,Γn,p,νn)t1(y1;μ(y0),σ2(y0),νm+p)=m=1Mαmn=1Mαntp(y0;μn1p,Γn,p,νn)tp+1((y1,y0);μm1p+1,Γm,p+1,νm).

It follows that the density of (y1,y0) is f((y1,y0);θ)=m=1Mαmtp+1((y1,y0);μm1p+1,Γm,p+1,νm). Integrating yp+1 out (and using the properties of marginal distributions of a multivariate t–distribution in Appendix A) shows that the density of y1 is f(y1;θ)=m=1Mαmtp(y1;μm1p,Γm,p,νm). Therefore, y0 and y1 are identically distributed. As {yt}t=1 is a (time homogeneous) Markov chain, it follows that {yt}t=1 has a stationary distribution πy(·), say, characterized by the density f(·;θ)=m=1Mαmtp(·;μm1p,Γm,p,νm) (cf. Meyn and Tweedie (Citation2009, pp. 230–231)).

For ergodicity, let Pyp(y,·)=Pr(yp|y0=y) signify the p–step transition probability measure of yt. It is straightforward to check that Pyp(y,·) has a density given by f(yp|y0;θ)=t=1pf(yt|yt1;θ)=t=1pm=1Mαm,tt1(yt;μ(yt1),σ2(yt1),νm+p).

The last expression makes clear that f(yp|y0;θ)>0 for all ypRp and all y0Rp. Now, one can complete the proof that yt is ergodic in the sense of Meyn and Tweedie (Citation2009, Ch. 13) by using arguments identical to those used in the proof of Theorem 1 in Kalliovirta, Meitz, and Saikkonen (Citation2015). □

Proof of Theorem 3.

First note that Assumption 1 together with the continuity of LT(c)(θ) ensures the existence of a measurable maximizer θ̂T. For strong consistency, it suffices to show that a certain uniform convergence condition and a certain identification condition hold. Specifically, the former required condition is that the conditional log-likelihood function obeys a uniform strong law of large numbers, that is, supθΘ|LT(c)(θ)E[LT(c)(θ)]|0 a.s. as T. As the yt’s are stationary and ergodic and E[LT(c)(θ)]=E[lt(θ)], condition E[supθΘ|lt(θ)|]< ensures that the uniform law of large numbers in Ranga Rao (Citation1962) applies.

The validity of condition E[supθΘ|lt(θ)|]< can be established by deriving suitable lower and upper bounds for lt(θ). Recall from Equation(10) and Equation(15) that lt(θ)=log(m=1Mαm,tt1(yt;μm,t,σm,t2,νm+p)), where t1(yt;μm,t,σm,t2,νm+p)=C(νm)σm,t1(1+(νm+p2)1(ytμm,tσm,t)2)1+νm+p2 and C(ν)=Γ((1+ν+p)/2)(π(ν+p2))1/2Γ((ν+p)/2). The following arguments hold for some choice of finite positive constants c1,,c10, and all staments are understood to hold ‘for all m=1,,M’ whenever appropriate. The assumed compactness of the parameter space (Assumption 1) and the continuity of the gamma function on the positive real axis imply that (A2) c1C(νm)c2.(A2) Next, recall that σm,t2=νm2+(yt1μm1p)Γm,p1(yt1μm1p)νm2+pσm2, where the matrix Γm,p is positive definite and σm2>0. Thus, by the compactness of the parameter space, σm,t2c3. On the other hand, as Γm,p is a continuous function of the autoregressive coefficients, the continuity of eigenvalues implies that the smallest eigenvalue of Γm,p, λmin(Γm,p), is bounded away from zero by a constant. This, together with elementary inequalities, yields (yt1μm1p)Γm,p1(yt1μm1p)λmin1(Γm,p)||yt1μm1p||2c4(1+yt12++ytp2). Thus, by the compactness of the parameter space, we have c3σm,t2c5(1+yt12++ytp2) so that also (A3) c51(1+yt12++ytp2)1σm,t2c31.(A3)

Therefore 11+(νm+p2)1(ytμm,tσm,t)2c6(1+yt2+yt12++ytp2), which, together with the compactness of the parameter space, implies that (A4) c7(1+yt2+yt12++ytp2)c8(1+(νm+p2)1(ytμm,tσm,t)2)1+νm+p21.(A4) Using (A2)–(A4) it now follows that c9(1+yt12++ytp2)1/2(1+yt2+yt12++ytp2)c8t1(yt;μm,t,σm,t2,νm+p)c10. Using this and the fact that m=1Mαm,t(θ)=1 we can now bound lt(θ) from above by a constant, say lt(θ)C¯<. Furthermore, for some C¯<, C¯(1+log(1+yt2+yt12++ytp2))lt(θ).

Hence, as the StMAR process has finite second moments, we can conclude that E[supθΘ|lt(θ)|]<.

As for the latter condition required for consistency, we need to establish that E[lt(θ)]E[lt(θ0)] and that E[lt(θ)]=E[lt(θ0)] implies θ=θ0. For notational clarity, let us make the dependence on parameter values explicit in the expressions in Equation(5) and write μ(·,ϑ) and σ2(·,ϑ), and let αm(y,θ) stand for αm,t (see Equation(11)) but with yt1 therein replaced by y and with the dependence on the parameter values made explicit (m=1,,M). Making use of the fact that the density of (yt,yt1) has the form f((yt,yt1);θ)=m=1Mαmtp+1((yt,yt1);μm1p+1,Γm,p+1,νm) (see proof of Theorem 2) and reasoning based on the Kullback-Leibler divergence, we can now use arguments analogous to those in Kalliovirta, Meitz, and Saikkonen (Citation2015, p. 265) to conclude that E[lt(θ)]E[lt(θ0)] with equality if and only if for almost all (y,y), (A5) m=1Mαm(y,θ)t1(y;μ(y,ϑm),σ2(y,ϑm),νm+p)=m=1Mαm(y,θ0)t1(y;μ(y,ϑm,0),σ2(y,ϑm,0),νm,0+p).(A5) For each fixed y at a time, the mixing weights, conditional means, and conditional variances in (A5) are constants, and we may apply the results on identification of finite mixtures of Student’s t–distributions in Holzmann, Munk, and Gneiting (Citation2006, Example 1) (their parameterization of the t–distribution is slightly different than ours, but identification with their parameterization implies identification in our parameterization). Consequently, for each fixed y at a time, there exists a permutation {τ(1),,τ(M)} of {1,,M} (where this permutation may depend on y) such that (A6) αm(y,θ)=ατ(m)(y,θ0),   μ(y,ϑm)=μ(y,ϑτ(m),0),   σ2(y,ϑm)=σ2(y,ϑτ(m),0),  andνm=ντ(m),0  for almost all y  (m=1,,M).(A6) The number of possible permutations being finite (M!), this induces a finite partition of Rp where the elements y of each partition correspond to the same permutation. At least one of these partitions, say ARp, must have positive Lebesque measure. Thus, (A6) holds for all fixed yA with some specific permutation {τ(1),,τ(M)} of {1,,M}. The fact that μ(y,ϑm)=μ(y,ϑτ(m),0) for m=1,,M, almost all y, and all yA, can be used to deduce that (φm,0,φm)=(φm,0,0,φτ(m),0) for m=1,,M (see (4, 5), and Kalliovirta, Meitz, and Saikkonen (Citation2015, pp. 265–266)). Similarly, using condition σ2(y,ϑm)=σ2(y,ϑτ(m),0) (and the knowledge that (φm,0,φm,νm)=(φm,0,0,φτ(m),0,νm,0)), it follows that σm2=στ(m),02 so that ϑm=ϑτ(m),0 (m=1,,M). Now αm=ατ(m),0 (m=1,,M) follows as in Kalliovirta, Meitz, and Saikkonen (Citation2015, p. 266). In light of Equation(16), the preceding facts imply that θ=θ0. This completes the proof of consistency.

Given conditions (i)–(iii) of the theorem, asymptotic normality of the ML estimator can now be established using standard arguments. The required steps can be found, for instance, in Kalliovirta, Meitz, and Saikkonen (Citation2016, proof of Theorem 3). We omit the details for brevity. □