3,090
Views
6
CrossRef citations to date
0
Altmetric
Research Article

Mean-Structure and Autocorrelation Consistent Covariance Matrix Estimation

ORCID Icon

Abstract

We consider estimation of the asymptotic covariance matrix in nonstationary time series. A nonparametric estimator that is robust against unknown forms of trends and possibly a divergent number of change points (CPs) is proposed. It is algorithmically fast because neither a search for CPs, estimation of trends, nor cross-validation is required. Together with our proposed automatic optimal bandwidth selector, the resulting estimator is both statistically and computationally efficient. It is, therefore, useful in many statistical procedures, for example, CPs detection and construction of simultaneous confidence bands of trends. Empirical studies on four stock market indices are also discussed.

1 Introduction

In many real applications, the observed time series {Yi}i=1n is a contaminated version of the ideal stationary time series {Xi}i=1n. The contamination may consist of an unknown trend, seasonality, and abrupt change points (CPs). This type of nonstationary time series is commonly encountered in Econometrics, Risk Management, Neurology, Genetics, Ecology, etc. (see, e.g., Horváth, Kokoszka, and Steinebach Citation1999; Grangera and Hyung Citation2004; Banerjeea and Urga Citation2005; Mikkonen et al. Citation2014; Kirch, Muhsal, and Ombao Citation2015). As a result, assessing the stationarity of Yi is usually indispensable before conducting inference and modeling. Many tests for this purpose require estimating the asymptotic covariance matrix (ACM) of X¯n=n1i=1nXi, namely, Σ=limnnvar(X¯n). Their performances rely on an efficient estimator of Σ that is robust against the mean and autocorrelation structures. This article addresses the problem of mean-structure and autocorrelation consistent (MAC) estimation of Σ.

One classical problem in assessing stability is CP detection. A large class of CP tests is based on the cumulative sum (CUSUM) process (e.g., Brown, Durbin, and Evans Citation1975; Ploberger and Krämer Citation1992; Jirak Citation2015). Among them, the celebrated Kolmogorov–Smirnoff (KS) test is arguably the most commonly used in detecting a mean shift. In the univariate case, the KS test statistic usually requires an estimator of the asymptotic variance constant (AVC) σ2, that is, the univariate analog of Σ, for standardization (see Csörgö and Horváth Citation1997). However, without a jump robust estimator of σ2, the KS test may not be monotonically powerful with respect to the jump magnitude (see Vogelsang Citation1999; Crainiceanu and Vogelsang Citation2007; Juhl and Xiao Citation2009). Indeed, the power may even completely vanish (see for a visualization of this phenomenon). However, many existing approaches are either restricted to one CP or do not fully eliminate the nonmonotone problem. Furthermore, for multidimensional CP tests (e.g., Horváth, Kokoszka, and Steinebach Citation1999), there is no robust estimator of Σ. Although Shao and Zhang (Citation2010) proposed a self-normalized KS test, which does not require estimating Σ, it sacrifices power for that. Its power may even completely vanish under a misspecified alternative (see Section 5.4).

Besides detection of CPs, there are many more dedicated procedures for assessing the stability of mean, for example, testing the existence of structural breaks in trends, constructing simultaneous confidence bands (SCB) of trends, and testing non-constancy of trends (see Wu, Woodroofe, and Mentz Citation2001; Wu Citation2004; Wu and Zhao Citation2007, and references therein). All of the aforementioned procedures require an estimator of σ2 that is jump robust as well as trend robust. It is worth mentioning that even in the absence of CP and trend, estimation of σ2 is already difficult because it requires specifying a bandwidth parameter (see, e.g., Andrews Citation1991; Newey and West Citation1994). To the best of our knowledge, there is no jump and trend robust estimator of Σ that equips with an optimal bandwidth estimator.

In view of the above problems, this article proposes a single-pass (i.e., neither estimation of CP nor trend is required) and fully nonparametric estimator of Σ for general multidimensional time series. It is consistent and robust even if there are a divergent number of CPs and nonconstant trends of unknown forms. Furthermore, a closed-form formula of the optimal bandwidth is derived so that users do not need to resort to computationally intensive cross-validation. Hence, the resulting estimator is MAC, statistically efficient, and computationally fast.

The remaining part of the article is organized as follows. Section 2 reviews some standard estimation methods of Σ. Section 3 provides motivation of deriving the proposed jump robust estimator; and presents the key theoretical results. Section 4 gives the extension to trend robustness. Implementation issues and generalization are also discussed. Section 5 illustrates finite sample performance Section 6 presents empirical studies on stock market indices. Section 7 concludes the article.

2 Review of Asymptotic Covariance Estimation

2.1 Mathematical Setup

Suppose the observed time series {YiRd}i=1n,dN, is generated from Yi=μ(i/n)+Xi, where μ:[0,1]Rd is a mean function; and {XiRd}iZ is strictly stationary and ergodic with mean EXi0,iZ, and autocovariance function (ACVF) Γk:=E(XkX0),kZ. Also denote the symmetrized ACVF by Πk:=(Γk+Γk)/2,kZ. The ACM of X¯n:=n1i=1nXi is defined by(1) Σ:=limnnvar(X¯n)=k=Γk=k=Πk ,(1) provided that the limit exists. Note that the ACM is also known as the time-average covariance matrix, long-run covariance matrix, and (scaled) spectral density at zero frequency.

The unknown mean μ(·) combines trend, seasonality and jump discontinuities:(2) μ(i/n):=f(i/n)+j=0Jξj 1{Dji<Dj+1} ,(2) where f:=fn:[0,1]Rd is a sequence of continuous functions; J:=Jn is the number of CPs; ξj:=ξj,n is the mean-shift from f in the time period [Dj,Dj+1), for j=0,,J; and Dj:=Dj,n is the jth CP, for j=1,,J, such that 1D0<D1<<DJ<DJ+1n+1. Here the indicator 1{E}=1 if the event E occurs, otherwise 1{E}=0. Without loss of generality, assume ξj1ξj for all j=1,,J. For simplicity, we write μi:=μ(i/n).

The unobservable time series {Xi} is assumed to admit a causal representation Xi=g(Fi), where g(·) is a d-dimensional measurable function, Fi:=(,εi1,εi); and {εi}iZ are independent and identically distributed multidimensional vectors of innovations (see Wu Citation2005). This framework is general enough to cover many commonly-used models, for example, autoregressive moving average (ARMA) model, Volterra series, bilinear (BL) model, threshold AR model, and generalized AR conditional heteroscedastic (GARCH) model (see, e.g., Wu Citation2011; Degras et al. Citation2012). More multivariate examples defined under this framework can be found in Sections 1 and 2 of Wu and Zaffaroni (Citation2018).

2.2 Mathematical Notations

The following notations are used in the article. Denote N={1,2,} and N0=N{0}. For aR,a and a are the floor and ceiling of a, respectively. For a,bR, denote ab=max(a,b) and ab=min(a,b). When the sample size n is clear, denote [[a]]=(2a)(n1). For real sequences {an} and {bn}, write anbn if an/bn1; an=o(bn) if an/bn0; and an=O(bn) if there are M > 0 and N such that |an/bn|M for all nN.

Matrices and vectors are written in boldface, while scalars are written in normal face. The (r, s)th element of a matrix A is denoted by A[r,s]. The uth component of a vector μ is denoted by μ[u]. In one-dimensional case (i.e., d = 1), the ACM in (1) is written as Σ, Σ[1,1] or σ2, and the mean function in (2) is written as μ(·) or μ[1](·).

For any matrix A, denote its entry-wise absolute value by |A|, its trace by tr(A), its transpose by A, its column-by-column vectorization by vec(A), and A2=AA. The diagonalization of a vector v is denoted by diag(v), that is, a diagonal matrix whose diagonal elements are the elements of v. Denote the column vector of ones, the column vector zeros, and the identity matrix by 1, 0, and I, respectively.

For any real random variable Z and any p1, denote Zp=(E|Z|p)1/p. For any vector-valued random variable Z, we write ZLp if Z[u]p< for all u. If ε1,,εn are identically and independently distributed (iid) as the standard normal distribution, we write ε1,,εniidN(0,1). If ε,ε are iid, then we say that ε is an iid copy of ε.

2.3 Estimation in Stationary Time Series

Suppose that μ1==μn. There are three standard classes of methods to estimate Σ. The first one is the subsampling method (see, e.g., Meketon and Schmeiser Citation1984; Carlstein Citation1986; Song and Schmeiser Citation1995; Politis, Romano, and Wolf Citation1999; Chan and Yau Citation2017b). For instance, the overlapping batch means (OBM) estimator is(3) Σ̂OBM,n:=lnl+1i=ln(1lj=il+1iX̂j)2 ,(3) where lN(1,n) is the batch-size, X̂i:=YiY¯n and Y¯n:=n1i=1nYi. The second one is the kernel method (see, e.g., Newey and West Citation1987; Andrews Citation1991; Politis Citation2011). For example, the Bartlett kernel and the quadratic spectral (QS) kernel estimators are(4) Σ̂Bart,n:=k=llBart(k/l)Γ̂kandΣ̂QS,n:=k=(n1)n1QS(k/l)Γ̂k ,(4) respectively, where Γ̂k:=n1i=|k|+1nX̂iX̂i|k|, Bart(t):=(1|t|)1(|t|1), and QS(t):=25{sin(6πt/5)/(6πt/5)cos(6πt/5)}/(12π2t2). The third one is based on the resampling method (see, e.g., Künsch Citation1989; Politis and Romano Citation1994; Paparoditis and Politis Citation2001; Lahiri Citation2003). Recently, a new class of estimators based on orthonormal sequences is proposed (see, e.g., Phillips Citation2005; Sun Citation2013). The choice of kernel or orthonormal sequences are discussed in Lazarus et al. (Citation2018). Besides, Müller (Citation2014) studied the problem under strong autocorrelation.

Estimation of Σ is important because it is usually required in the inference of μ, for example, construction of SCB for μ, and output analysis in Markov chain Monte Carlo (see Flegal and Jones Citation2010; Chan and Yau Citation2016, Citation2017a; Liu and Flegal Citation2018). All three methods above require specifying an unknown bandwidth l (or the batch size, block size, etc.) In practice, l is crucial to the performance of estimators but its optimal value is notoriously difficult to estimate (see, e.g., Politis Citation2003; Hirukawa Citation2010).

2.4 Estimation in Nonstationary Time Series

Suppose that μiμj for some ij. In this case, as far as we know, (i) all existing estimators of Σ are either restricted to particular forms of mean-structure; and (ii) the estimators are not equipped with the optimal bandwidth. Some representative estimators are listed below, and are summarized in . For reference, the precise formulas of the estimators are presented in Section C.1 of the supplementary materials.

Table 1 A summary of the robust estimators introduced in Section 2.4, where AC, CV, J, W, and WZ represent estimators proposed in Altissimoa and Corradic (Citation2003), Crainiceanu and Vogelsang (Citation2007), Jirak (Citation2015), Wu (Citation2004), and Wu and Zhao (Citation2007), respectively.

Altissimoa and Corradic (Citation2003) proposed estimating σ2 by applying a standard kernel estimator to the time series after being de-trended by a local mean estimator. The resulting estimator is consistent when the mean is a piecewise constant function with finitely many breaks. However, there are some drawbacks. First, they did not derive the optimal bandwidth. It is possible that the optimal bandwidth of the modified estimator is different from that of the standard kernel estimator. Second, the modified estimator introduces an extra tuning parameter, that is, the bandwidth for the local mean estimator. This bandwidth has to be chosen carefully to have a consistent estimator of σ2. However, its optimal value is unsolved. A similar method was proposed by Juhl and Xiao (Citation2009) in a hypothesis testing context. However, their estimator is inconsistent under non-stationarity.

Crainiceanu and Vogelsang (Citation2007) found that a CP test has a non-monotonic power if a non-robust estimator of σ2 is used. They proposed an estimator of σ2 that is robust to one CP. Their idea is to estimate a potential CP and then de-mean the observed time series before and after the estimated CP separately. So, the standard methods in Section 2.3 can be applied to estimate σ2. Their remedy mitigates the non-monotone problem, but it still has some drawbacks. First, it allows a single CP only; and the trend must be a piecewise constant. In reality, these assumptions may not be satisfied (see Section 6.1). Second, the optimal bandwidth is estimated by a parametric plug-in method proposed by Andrews (Citation1991). If the parametric model is misspecified, its performance is doubtful. Recently, Jirak (Citation2015) proposed a similar de-trending method for estimating σ2 robustly, but the optimal bandwidth selection issue was not addressed.

In Wu (Citation2004) and Wu and Zhao (Citation2007), they proposed using the first-order difference of nonoverlapping batch means (NBMs) to construct robust estimators of σ2. There are some drawbacks. First, NBM-type estimators are less efficient than the overlapping batch means counterpart in terms of mean-squared error (MSE) (see Politis, Romano, and Wolf Citation1999). Thus, their estimators have a significant loss in L2 efficiency. It is worth noting that there is no trivial way to extend their NBM-type estimators to the more efficient OBM-type estimators (see Remark C.2 in the supplementary materials). Second, they did not derive the optimal bandwidth.

Remark 2.1.

Gonçalves and White (Citation2002) proved that two block bootstrap estimators (Künsch Citation1989; Politis and Romano Citation1994) are consistent under a mild nonconstant mean structure, namely Un:=i=1n(μiμ¯n)2/n=o(1/l), where l is the block size used in the estimators, and μ¯n=i=1nμi/n. However, Un=o(1/l) does not hold if there is one nontrivial jump in mean. For example, if μi= 1(in/2), that is, the mean jumps from 0 to 1 at n/2, then Un1/40. Gallant and White (Citation1988) documented similar results in the context of heteroscedasticity and autocorrelation consistent (HAC) variance estimation. In Theorem 6.8, they showed that standard HAC estimators are biased unless the mean is a constant.

3 Jump Robustness

3.1 Motivation

Throughout Section 3, μ(·) is assumed to be a piecewise constant function with J jumps, that is, f(x)0. This assumption will be relaxed in Section 4.1. Our proposal is to use a differencing technique consecutively (see Remark 3.1). If the number of jumps J and the magnitude of jumps |ξjξj1| are not too large, then each value in the lag-1 difference sequence {YiYi1}i=2n is of mean zero approximately. In this case, we haveE{12(n1)i=2n(YiYi1)2}12(2Γ0Γ1Γ1)=Π0Π1 .

So, the semi-average of the lag-1 difference sequence is a potential estimator of Π0Π1, the spread between the symmetrized ACVF at lag 0 and lag 1. Similarly, a potential estimator of Π0Πk is the semi-average of the lag-k difference sequence (5) Ψ̂k:=12(n|k|+1)i=|k|+1n(YiYi|k|)2,|k|=0,1,,n1 .(5)

The convention Ψ̂t:=Ψ̂|t|(n1) is used for tR. The summability of Πk in (1) implies ΠL0 as L. Hence, Ψ̂L approximates Π0 for large L. The bi-differencing estimator(6) Π̂k(L):=Ψ̂LΨ̂k,|k|=0,1,,n1 ,(6) is, thus, a potential estimator of Πk when L is large. Observe that the sample mean Y¯n is not involved in the definition of Π̂k(L), therefore, we can estimate the ACVFs without estimating the mean μ. The concept of bi-differencing is new. The first and second differencing operations (5) and (6) remove the first and second-order offsets, that is, μ(·) and Π0, respectively. A graphical illustration of the bi-differencing concept can be found in Section A of the supplementary materials. Using the representation Σ=k=Πk in (1), we may use a “naive” estimator of Σ as follows:(7) Σ̂naive,n:=k=llK(k/l)Π̂k(L)=k=llK(k/l)(Ψ̂LΨ̂k),(7) where lN is a bandwidth, K(·) is a kernel function, and L=c0l for some c01.

However, MSE(Σ̂naive,n[r,s])0 slowly for all r,s{1,,d}. We demonstrate it through a simple Monte Carlo experiment in Section B of the supplementary materials. An explanation of the slow convergence of Σ̂naive,n is that the “same correction term” Ψ̂L is used for all Ψ̂k,k=0,,l, in (7). So, the variance of the “aggregated correction term” k=llK(k/l)Ψ̂L=O(l)Ψ̂L increases with l quadratically. This is a huge loss in L2 efficiency because the variance of a standard ACM estimator only increases with l linearly (see Andrews Citation1991, Proposition 1(a)). Our strategy is to replace Π̂k(L) in (7) by Π̂k(Lk) with an appropriately chosen sequence {LkR+}kZ. This sequence should satisfy the following two conditions.

  1. (Bias condition) Lk as l for each k so that Ψ̂LkΠ0 for each k. It ensures that Π̂k(Lk)=Ψ̂LkΨ̂k is able to accurately approximate Πk with a small bias.

  2. (Variance condition) Lk as |k| for each l so that asymptotically different correction terms Ψ̂L0,,Ψ̂Ll are used to correct Ψ̂0,,Ψ̂l for each l. Since Ψ̂L0,,Ψ̂Ll are not perfectly correlated, it help reducing the variance of the new “aggregated correction term” k=llK(k/l)Ψ̂Lk.

Hence, Lk should be increasing with l and |k|. One choice of such Lk is a linear combination of l and |k| with positive weights, that is, Lk=c0l+c1|k|, c0,c1R+. Note that Σ̂naive,n[r,s] sets c1=0, which violates the variance condition. In the remaining part of this article, we will demonstrate that Lk=c0l+c1|k| is sufficient to produce optimal results (see Remark 3.2).

Remark 3.1. D

ifference-based estimators are not new. It has been used in time series analysis and robust estimation (see, e.g., Anderson Citation1971; Hall, Kay, and Titterinton Citation1990; Dette, Munk, and Wagner Citation1998; Hall and Horowitz Citation2013). However, they are restricted to the estimation of the marginal variance Γ0. Differently, we aim at estimating the ACM Σ=kZΓk. It is a harder problem than estimating Γ0, and requires a new technique called bi-differencing. Our bi-differencing technique is partially motivated by the bipower variation (Barndorff-Nielsen and Shephard Citation2004) in the context of testing for jumps in a continuous time series.

Remark 3.2. A

s we will show in Theorems 3.1 and 3.2, a linear form of Lk already achieves the optimal convergence rate. Although an incremental improvement maybe possible by using a more general form of Lk, we leave it for future investigation.

3.2 Proposed Robust Estimators and Overview of Main Results

For estimation of Σ, we can use the polynomial kernel Kq(x)=(1|x|q)1{|x|1} for some qN. Then the jump robust estimator of the ACM Σ is defined by(8) Σ̂0,q,n:=k=llKq(|k|/l)·Π̂k=k=ll{1|kl|q}{Ψ̂c0l+c1|k|Ψ̂k} ,(8) where l=lnN(1,n). Using other kernels in (8), for example, QS(·), is also possible, however, we only focus on the polynomial kernel Kq(·) in this article to avoid complication. Extension to other kernels is routine. Users may choose their favorite kernel and their favorite sequence Lk. Relative to l, these choices have slightly less impact on Σ̂p,q,n at least in the first-order asymptotic. The effect of kernel choice on higher order asymptotic (see, e.g., Lazarus et al. Citation2018) is theoretically interesting. However, it is beyond the scope of this article. We leave it for future research.

Suppose that Σq:=k=|k|qΠk exists and its entries are finite. Under the conditions in Theorems 3.1, 3.2, and part (1) of Corollary 4.1 (to be presented in Sections 3.3 and 4.1), the value of MSE(Σ̂0,q,n[r,s])=E(Σ̂0,q,n[r,s]Σ[r,s])2 is given by(9) MSE(Σ̂0,q,n[r,s])(Σq[r,s])21l2q+[4q2(1+c1){Σ[r,r]Σ[s,s]+(Σ[r,s])2}(q+1)(2q+1)]ln(9) for each r, s. Hence, if l=O(n1/(1+2q)), then MSE(Σ̂0,q,n[r,s])=O(n2q/(1+2q)), which is the optimal convergence rate achieved by the standard estimators (see, e.g., Andrews Citation1991). In other words, the proposed robust estimator Σ̂0,q,n is rate-optimal in the L2 sense.

From (9), the MSE of the proposed estimator Σ̂0,q,n[r,s] depends on Σq[r,s]. Hence, its MSE-optimal bandwidth l also depends on Σq. As a result, a robust estimator of Σq is also important for estimating the optimal bandwidth. This phenomenon is similar to the classic results in non-robust estimation of Σ (see, e.g., Andrews Citation1991). It motivates us to study robust estimation for all Σ0,Σ1,Σ2,. Similar to (8), our proposed jump robust estimator of Σp=kZ|k|pΠk (pN0) is defined as(10) Σ̂p,q,n:=Σ̂p,q,n(Y1:n,l,c0,c1):=k=llKq(|k|/l)·|k|p·Π̂k.(10)

The statistical meanings of p and q are summarized in the first two rows of .

Table 2 Summary of the statistical meanings of p, q, P and their associated quantities.

3.3 Theoretical Results

We develop a general estimation procedure which takes various levels of serial dependence into account. For each PN0, define ϒP:=k=|k|P|Πk|. The finiteness of ϒP characterizes the strength of serial dependence, thus it is usually served as an assumption for proving consistency of estimators (see, e.g., Politis Citation2011, Theorem 1). More precisely, we follow Chan and Yau (Citation2017b) to define the coefficient of serial dependence of {Xi} by(11) CSD(X):=sup{PN0:ϒP[,]<},whereϒP[,]:=maxr,s{1,,d}ϒP[r,s].(11)

Clearly, the larger the value of CSD(X), the weaker the serial dependence. For example, consider a univariate fractional Gaussian noise process (Davies and Harte Citation1987) defined as a Gaussian process with ACVF Γk=a(|k|+c)b for each k, where a,c>0 and b1. In this model, ϒP< if and only if P<b1. Hence, CSD(X)=b1. More examples and their associated values of ϒP can be found in Appendix B of Chan and Yau (Citation2017b). Some multivariate examples can be found in Section C.3 of Chan and Yau (Citation2017a). As we shall see in Section 3.3.1, the assumption of CSD plays a critical role in controlling the bias of the estimator Σ̂p,q,n.

Asymptotic theories are built on the framework of dependence measures (see Wu Citation2005). Recall that Xi=g(Fi) and Fi:=(,εi1,εi) (see Section 2.1). Let εj be an iid copy of εj. Denote Xi,{j}:=g(Fi,{j}) and Fi,{j}:=(Fj1,εj,εj+1,,εi). Define the physical dependence measure and its aggregated value by, respectively,δ4,i[u]:=Xi[u]Xi,{0}[u]4andΔ4[u]:=i=0δ4,i[u] .

For example, consider a univariate linear process (Brockwell and Davis Citation1991, Definition 3.2.1) defined as Xi=j=0cjεij, where {cj} are real coefficients such that j=0|cj|<, and {εj} are iid noises such that E|ε0|4<. Then δ4,i[1]=K|ci| for each i, and Δ4[1]=Kj=0|cj|<, where K=||εiεi||4<. More univariate examples and their associated values of physical dependence measures can be found in Examples 1–11 of Wu (Citation2011). Also see Models I–VI in Example 1 of Chan and Yau (Citation2017a) for some multivariate examples. Finiteness of Δ4[]:=maxu{1,,d}Δ4[u] (i.e., Assumption 3.1) is a mild and easily-verifiable condition for studying asymptotic properties (see Wu (Citation2007)).

Assumption 3.1

(Short range dependence). The time series {Xi} satisfies Δ4[]<.

Assumption 3.1

rules out time series having very strong serial dependence, for example, time series with Σ[r,s]=. Indeed, Assumption 3.1 implies the existence of Σ. More importantly, it leads to the invariance principle for the (scaled) partial sum i=1tnXi/n for 0t1. It is required for deriving the variance of Σ̂p,q,n. Assumption 3.1 is satisfied by many important time series models, including the aforementioned linear process, ARMA and BL models (see, e.g., Wu Citation2005; Liu and Wu Citation2010). Note also that some parallel formulations of dependence like strong mixing coefficient (Rosenblatt Citation1985) have been widely adopted by researchers. However, the mixing type assumptions are sometimes difficult to verify. On the contrary, Assumption 3.1 is more easily verifiable (see Wu Citation2011).

We also need to regularize the size of the bandwidth l. Denote J:={1,,J} and Ju:={jJ:μDj1[u]μDj[u]} for u=1,,d.

Assumption 3.2

(Conditions on l). The bandwidth l=ln satisfies (i) l as n, (ii) l=o(n) as n, and (iii) {(c0+c1)1}linfu{1,,d}infjJu(Dj+1Dj).

In Assumption 3.2, conditions (i) and (ii) require that the size of l cannot be too small or too large, respectively. These conditions are commonly required in the small-l subsampling approach (i.e., l/n0) (see Politis, Romano, and Wolf Citation1999). Condition (iii) states that two consecutive CPs cannot be too close within the same component of the time series. Indeed, condition (iii) is stronger than needed but it makes derivations easier.

3.3.1 Bias and Variance Expressions

Let χ:= 1{c11c0Fp,q}, where Fp,q:=(p+2)(p+q+2)/{(p+1)(p+q+1)}. Also letan:=supu{1,,d}supjJu|μj[u]μj1[u]|.

Also recall that J = Jn denotes the number of CPs (see (2)). The bias of the jump robust estimator, Bias(Σ̂p,q,n[r,s]):=E(Σ̂p,q,n[r,s])Σp[r,s], is given below.

Theorem 3.1

(Bias of the estimator). Suppose that X1L2,f(x)0, CSD(X)=Pp+q, and Assumption 3.2 holds, where pN0 and qN. Then, for c0,c1R+ and r,s{1,,d},(12) Bias(Σ̂p,q,n[r,s])=1lqΣp+q[r,s]+rnbias ,rnbias=O{lp+1n(lχ+l2n)an2Jn}+o(1lq) .(12)

In Theorem 3.1, the assumption CSD(X)=p+q controls the rate of decay of ACVF. For a fixed p, if the value of CSD(X) is larger, then q is larger, and the autocorrelation is weaker. Consequently, the autocorrelation at large lags only introduce a small bias to Σ̂p,q,n[r,s]. Hence, it makes sense that the magnitude of the leading term of the bias in (12), that is, |Σp+q[r,s]|/lq=O(1/lq), is decreasing with q. Besides, Jn and an determine the frequency of the CPs and the magnitude of the jumps, respectively. From (12), if an2Jn is not too large so that rnbias=o(1/lq), the dominating term of the asymptotic bias is Σp+q[r,s]/lq. Consequently, Σ̂p,q,n[r,s] is asymptotically unbiased as l. Technical conditions for controlling rnbias are discussed in Corollary 4.1. Moreover, c0 and c1 do not affect the first-order asymptotic bias of Σ̂p,q,n.

Define Ξ[r,s]:=Σ[r,r]Σ[s,s]+(Σ[r,s])2. The variance of Σ̂p,q,n[r,s] is given below.

Theorem 3.2

(Variance of the estimator). Suppose that X1Lν for ν>4, f(x)0, and Assumptions 3.1 and 3.2 hold. If pN0,qN, c0,c1R+ and r,s{1,,d}, then(13) var(Σ̂p,q,n[r,s])=4q2(1+c1)Ξ[r,s]l1+2p(2p+1)(2p+q+1)(2p+2q+1)n+rnvar ,rnvar=O[l1+2pn{l2n(1+an2Jn)+o(1)}].(13)

Theorem 3.2 requires Assumption 3.1 because its proof relies on the invariance principle, which is guaranteed by Assumption 3.1 (see, e.g., Wu Citation2005). However, the detailed strength of serial dependence (i.e., the CSD) is not important for deriving (13). Besides, unlike the asymptotic bias, the variance of Σ̂p,q,n depends also on Lk. However, only c1 but not c0 is relevant. Since c1 determines the speed of divergence of Lk as k, the variance in (13) is naturally increasing with c1. Note that c0 and c1 are not tuning parameters for balancing the leading terms of the bias and variance because c0 and c1 are not involved in (12). Although c0 and c1 can be chosen optimally by balancing the second-order bias and variance, that is, rnbias and rnvar, the effect on Σ̂p,q,n[r,s] is relatively incremental.

Consider l=O(nθ) for some θ(0,1). The MSE-optimal value of θ can be found by balancing the squared-bias and variance of Σ̂p,q,n[r,s] so that the MSE is minimized. Assume an2Jn sufficiently slow so that rnbias=o(1/lq) and rnvar=o(l1+2p/n) (see Corollary 4.1 for explicit conditions to guarantee that). In this case, Theorems 3.1 and 3.2 imply that(14) MSE(Σ̂p,q,n[r,s])={Bias(Σ̂p,q,n[r,s])}2+var(Σ̂p,q,n[r,s])=O(1/l2q)+O(l1+2p/n).(14)

If l=O(nθ), then (14) achieves its minimum order, that is, MSE(Σ̂p,q,n[r,s])=O(nλ), where(15) θ:=1/(1+2p+2q)andλ:=2q/(1+2p+2q).(15)

Note that the superscript “” indicates optimal values. It is worth mentioning that the robust estimator Σ̂p,q,n achieves the same optimal L2 convergence rate as the non-robust counterparts (see, e.g., Andrews Citation1991; Chan and Yau Citation2017b).

3.3.2 Theoretically Optimal Bandwidth

In this subsection, we derive the optimal lϕnθ, ϕR+, such that the MSE of Σ̂p,q,n is optimized up to the first order including its proportionality constant.

Suppose rnbias=o(1/lq) and rnvar=o(l1+2p/n). Let W be a weight matrix specifying the entry-wise importance of Σp. For example, W=(1{rs})r,s=1d puts equal weight on each element of the upper triangular part (including the diagonal) of Σp. Write W0 if W[r,s]0 for all r, s, and W[r,s]>0 for at least one pair of r, s. From now on, assume W0. Denote W:=diag{vec(W)} (see Section 2.2 for the definitions of diag(·) and vec(·)). Then the optimal value of ϕ is the minimizer of(16) AMSEp,q,W(Σ̂p,q,):=limnn2q/(1+2p+2q)MSEW(Σ̂p,q,n),whereMSEW(Σ̂p,q,n):=E[{vec(Σ̂p,q,nΣp)}W{vec(Σ̂p,q,nΣp)}] .(16)

The weighted MSE (16) is a generalization of the L2 risk under the Frobenius norm ||·||F because, for any square matrix A, {vec(A)}W{vec(A)}=tr(AA)=||A||F2 if W=11. Similar weighting rule is also adopted by Andrews (Citation1991) and Chan and Yau (Citation2017a). By Theorems 3.1 and 3.2, the optimal value of ϕ is ϕ, where(17) ϕ:=ϕp,q:={(2p+q+1)(2p+2q+1)κp+q2q(1+c1)}θ ,(17) (18) κp+q:=(vecΣp+q)W(vecΣp+q)(vecΞ)W(vecΞ)=(vecΣp+q)W(vecΣp+q)(vecΣ)W(vecΣ)+tr{W(ΣΣ)} .(18)

Here AB is the Kronecker’s product of A and B. Note that the size of the optimal bandwidth lϕnθ depends on two parameters θ and ϕ.

  • The parameter θ=1/(1+2P) controls the divergence rate of l=O(nθ). Recall, from , that if P is small, then the serial dependence is strong. Hence, it makes sense to have a larger optimal bandwidth l to cover more autocovariances.

  • The parameter ϕ controls the leading coefficient of l. The value of ϕ depends on the unknown κp+q. We interpret this quantity for univariate {Xi}. In this case, κp+q=(Σp+q/Σ)2/2, which is not purely increasing with the strength of autocorrelation. Indeed, it also depends on the sign of autocorrelation. For example, if Γk=ρk for some ρ(1,1), then κ2=2ρ2/(1ρ)4, which is not an increasing function of |ρ|. This interesting phenomenon also exists in the standard variance estimation (see Andrews Citation1991, (5.1) and (5.2)). Estimation of ϕ is presented in Section 4.3.

Formula (17) handles all entries of Σ simultaneously. If the dependence structures of {Xi} vary dramatically across entries, we may construct entry-adaptive optimal bandwidth. Let eu:=(0,,0,1,0,,0) be the uth elementary d-vector, that is, eu[v]=1{v=u} for all v{1,,d}. Setting W=eres, we can produce the optimal bandwidth for the (r, s)th entry of Σ. The resulting optimal asymptotical MSE (AMSE) of Σ̂p,q,n[r,s] is given by(19) nλp,qE(Σ̂p,q,n[r,s]Σp[r,s])211λ{λ(1+c1)2p+q+1}λΞ[r,s]{(Σp+q[r,s])2Ξ[r,s]}1λ .(19)

4 Extension, Discussion, and Implementation

4.1 Extension to Trend Robustness

In this section, we consider the full generalization of {Yi}, that is, the assumption f(x)0 is removed. We measure the amount of fluctuation of f=fn bybn:=supu{1,,d}supx,x[0,1]|fn[u](x)fn[u](x)xx|,which is small if the fluctuation of fn does not grow too fast with n. The following two theorems state the bias and variance of Σ̂p,q,n when μ(·) consists of jumps and trends.

Theorem 4.1

(Bias of the estimator). If the assumption f(x)0 is removed, then, under all other conditions in Theorem 3.1, (12) is satisfied with rnbias being replaced by Rnbias=rnbias+O{lp+3(bn2+Jnanbn)/n2}.

Theorem 4.2

(Variance of the estimator). If the assumption f(x)0 is removed, then, under all other conditions in Theorem 3.2, (13) is satisfied with rnvar being replaced by Rnvar=rnvar+O(l4+2pbn2/n3) .

The optimal bandwidth in (17) and the optimal MSE in (19) remain valid, provided that Rnbias=o(1/lq) and Rnvar=o(l1+2p/n). Consequently, Σ̂p,q,n achieves the optimal convergence rate even in the presence of jumps and continuous trends. The remainder terms Rnbias and Rnvar are influenced by (i) the jump effect an2Jn, (ii) the trend effect bn2, and (iii) their joint effect anbnJn. Using these three factors, we define the following classes of mean functions:M:={μ(·):an2Jn=o(nθ(p+qχ)),anbnJn+bn2=o(nθ(3p+3q1))},M:={μ(·):an2Jn=o(nθ(p+2qχ)),anbnJn+bn2=o(nθ(3p+4q1))}.

Both M and M include only reasonably well-behaved mean functions μ(·) such that the aforementioned effects (i), (ii), and (iii) are small. Clearly, MM. Simple conditions to control Rnbias and Rnvar are given below.

Corollary 4.1. A

ssume the conditions in Theorems 4.1 and 4.2. Let l=O(nθ).

  1. If μ(·)M, then Rnbias=o(1/lq) and Rnvar=o(l1+2p/n).

  2. If μ(·)M, then Rnbias=o(1) and Rnvar=o(1).

The above results remain valid if Rnbias and Rnvar are replaced by rnbias and rnvar, respectively.

Corollary 4.1

ensures that Σ̂p,q,n is L2 consistent if μ(·) belongs to the well-behaved class M. If μ(·) belongs to a more well-behaved class M, we also have the optimal results (17) and (19), which imply that the convergence rate of the estimator Σ̂p,q,n is not affected by jumps and trends. However, the standard (non-robust) estimators, for example, Σ̂OBM,n in (3) and Σ̂QS,n in (4), are not guaranteed to be consistent if μ(·)M.

For example, consider the estimator Σ̂0,2,n with l=O(n1/5), and any c0>0 and c11. It is L2 consistent, and satisfies the optimal results (17) and (19) if μ(·) belongs to(20) M={μ(·):an2Jn=o(n1/5),anbnJn+bn2=o(n)}.(20)

The class M in (20) includes (but not restricted to) mean functions having piecewise Lipschitz continuous trends with at most Jn=o(n1/5) bounded jumps. Note that such Jn is allowed to be divergent to infinity as n. See the last row of for a summary and a comparison with existing robust estimators. We illustrate Corollary 4.1 through a simple simulation experiment. Let Xi=0.5Xi1+0.5εi1+εi, where εiiidN(0,1). Consider(21) μ(t)=41(0.2t<0.3)+2e2t+sin(8πt),(21) which consists of two CPs, an exponentially increasing trend, and a periodic structure. In this case, an = 4, bn=4(e2+2π), and Jn = 2. Hence, the mean function (21) is a member of the class M defined in (20). shows a typical realization of Yi=Xi+μi (1i400). The density functions of Σ̂0,2,n and Σ̂QS,n are shown in . The proposed estimator Σ̂0,2,n concentrates at around the true value Σ = 9, however, the standard estimator Σ̂QS,n is obviously off the targeted value.

Fig. 1 (a) A typical realization of the time series with the mean function defined in (21). (b) The density functions of Σ̂0,2,n and Σ̂QS,n when n = 400. The true value is Σ = 9.

Fig. 1 (a) A typical realization of the time series with the mean function defined in (21). (b) The density functions of Σ̂0,2,n and Σ̂QS,n when n = 400. The true value is Σ = 9.

4.2 Comparison With Standard Estimators

The estimator Σ̂p,q,n sacrifices statistical efficiency to gain robustness. In this section, we investigate how much efficiency is lost.

The proposed robust estimator Σ̂0,1,n uses the Bartlett kernel K1(·)Bart(·). So, we compare it with the standard non-robust Bartlett kernel estimator Σ̂Bart,n defined in (4). Denote the optimal bandwidths for Σ̂0,1,n and Σ̂Bart,n by l0,1 and lBart, respectively. According to (15) and (17), and Equation (5.2) of Andrews (Citation1991), they are given byl0,1(3κ1n2(1+c1))1/3andlBart(3κ1n2)1/3,respectively, where κ1 is defined in (18). Denote the resulting optimal estimators by Σ̂0,1,n and Σ̂Bart,n, respectively. The ratio of their weighted MSEs (see (16)) is given below.

Proposition 4.1. A

ssume the conditions in Theorems 3.1 and 3.2. Let c0,c1>0,W0, and μ(t)=0 for all t[0,1]. Then MSEW(Σ̂0,1,n)/MSEW(Σ̂Bart,n)(1+c1)2/3>1.

According to Proposition 4.1, the non-robust estimator Σ̂Bart,n is more efficient than the robust estimator Σ̂0,1,n asymptotically. It makes sense. Note that the efficiency loss is smaller if c1 is smaller. However, in finite sample, setting c10 may degenerate the estimator to the naive estimator Σ̂naive,n defined in (7). Hence, using a small c1>0 is suggested only if the sample size n is extremely large. Practical suggestion on selecting c1 is discussed in Section 4.3.

Besides, we also compare our estimator with the most promising (univariate) robust estimator proposed by Wu, Woodroofe, and Mentz (Citation2001), Wu (Citation2004), and Wu and Zhao (Citation2007), namely,(22) σ̂WZ3,n2:=l2(m1)k=2m(AkAk1)2,(22) where m=n/l, and Ak=l1i=1+(k1)lklYi is the kth non-overlapping batch mean (NBM) for k=1,,m. The optimal MSE of σ̂WZ3,n2 was not derived by the authors. For reference, we derive it under the constant mean assumption. Applying similar techniques as in Theorems 3.1 and 3.2, we have Bias(σ̂WZ3,n2)Σ1/l and var(σ̂WZ3,n2)7Σ2l/(2n). The optimal bandwidth is l{4Σ12n/(7Σ2)}1/3. Consequently, MSE(Σ̂0,1,n)/MSE(σ̂WZ3,n2){8(1+c1)/21}2/3. In particular, when c1=1, our estimator Σ̂0,1,n is uniformly better than σ̂WZ3,n2, and satisfies that MSE(Σ̂Bart,n):MSE(Σ̂0,1,n):MSE(σ̂WZ3,n2)1.00:1.59:1.90 when n is large and their respective optimal bandwidths are used.

4.3 Choices of q, c0, c1, and l

The best estimator in Wu and Zhao (Citation2007) has a MSE of size O(n2/3), whereas our proposed estimator Σ̂0,q,n has a much smaller MSE, that is, O{n2q/(1+2q)}, if q > 1. In practice, if there is no prior information, we suggest q = 2, that is, assuming CSD(X)=2, which is essentially equivalent to the assumption (ϒ2<) made by Paparoditis and Politis (Citation2001).

Although we develop theories for all c1>0, it makes little sense to use c1(0,1) statistically and intuitively. To see it, observe that Π̂k(Lk)=Ψ̂LkΨ̂|k| is a reasonable estimator of Πk only if Lk>|k|, which is satisfied for all k if and only if c11. Hence, it is sensible (but not necessary) to assume c1[1,), among which c1=1 minimizes the AMSE. So, c1=1 is suggested in practice. For q > 1, Σ̂p,q,n has the same AMSE for any c0>0, hence, c0 does not affect the asymptotic behavior. We illustrate in Section C.4 of the supplementary materials that the finite sample performance of Σ̂p,q,n is essentially the same for any c0 that is not close to zero. In practice, we suggest using c0=1 as a default choice.

If an initial pilot estimate of Σp is needed, we can use Σ̂p,q,n with a rate optimal bandwidth l=O(nθ). In practice, we suggest l=[[2nθ]], where [[t]]:=(2t)(n1). According to our simulation experience, this rule-of-thumb bandwidth gives reasonably good performance. Using the notation in (10), we denote the resulting pilot estimator by(23) Σ̂p,q,n:=Σ̂p,q,n(Y1:n,l=[[2n1/(1+2p+2q)]],c0=1,c1=1).(23)

In particular, for estimating ΣΣ0, our recommended default estimator is as simple as(24) Σ̂0,2,n=k=ll(1|kl|2)(Ψ̂l+|k|Ψ̂|k|),(24) where l=[[2n1/5]] and Ψ̂h={2(n|h|+1)}1i=|h|+1n(YiYi|h|)2. If a more accurate estimate of Σp is needed, we can use Σ̂p,q,n with a fully optimal bandwidth lϕnθ. From (17), ϕ is a function of Σ and Σp+q. So, the value of ϕ is unknown. We propose to first estimate Σ and Σp+q by the pilot estimators Σ̂0,2,n and Σ̂p+q,2,n. Then ϕ is consistently estimated by plugging in these estimated values into (17) and (18), that is,(25) ϕ̂:={(2p+q+1)(2p+2q+1)(vecΣ̂p+q,2,n)W(vecΣ̂p+q,2,n)2q(1+c1)(vecΣ̂0,1,n)W(vecΣ̂0,2,n)+tr{W(Σ̂0,2,nΣ̂0,2,n)}}1/(1+2p+2q).(25)

Using l̂:=[[ϕ̂nθ]], the estimator Σ̂p,q,n is equipped with the optimal bandwidth asymptotically. The resulting estimator(26) Σ̂p,q,n:=Σ̂p,q,n(Y1:n,l=[[ϕ̂n1/(1+2p+2q)]],c0=1,c1=1)(26) is called the qth order MAC estimator of Σp. It can be computed by Algorithm 1. The R-package MAC is built for implementing it.

Algorithm 1: Proposed MAC estimator Σ̂p,q,n for estimating Σp

[1] Input:

[2] (i) Y1:nd-dimensional time series;

[3] (ii) p—order of the estimand Σp (set p = 0 for estimation of the ACM Σ);

[4] (iii) q—order of the polynomial kernel Kq(·) (set q = 2 by default);

[5] (iv) c0, c1—parameters (set c0=c1=1 by default); and

[6] (v) Wd × d weight matrix (set W[r,s]=1{rs} for each 1r,sd by default).

[7] begin

[8] Compute Σ̂0,2,n and Σ̂p+q,2,n according to (23);

[9] Compute ϕ̂ according to (25);

[10] Compute the estimated optimal bandwidth l̂=[[ϕ̂n1/(1+2p+2q)]];

[11] Compute Σ̂p,q,n=Σ̂p,q,n(Y1:n,l=l̂,c0,c1) according to (10).

return Σ̂p,q,n – MAC estimator of Σp.

4.4 Discussion on Robustness to Heteroscedasticity

Thus far we have assumed that the noise sequence {Xi} is stationary (i.e., without heteroscedasticity). Now, suppose that {Xi} is not stationary but satisfies E(Xi)=0. In this case, we define the finite-n version of Σ0=limnnvar(Y¯n) by(27) Σ0,n:=nvar(Y¯n)=nE(X¯nX¯n)=1n1i,jnE(XiXj)=|k|<nΠk,n,(27) where Πk,n=i=1+|k|nE(XiXi|k|+Xi|k|Xi)/(2n). Following the arguments in Section 3.1, it is not hard to see that Π̂k(L) still approximates Πk,n. Thus, it is not surprising that the proposed estimator Σ̂p,q,n continues to be consistent for Σp,n:=|k|<n|k|pΠk,n. Similar to Section 8 of Andrews (Citation1991), we can extend the consistency results to heteroscedastic time series. Suppose the regularity conditions of Theorem 4.1, Theorem 4.2, and part (1) of Corollary 4.1 are satisfied except the following changes.

  • The stationarity of the noise sequence {Xi} is removed. However, it still satisfies that there is some ν>4 such that XiLν and E(Xi)=0 for all iZ.

  • The assumption CSD(X)=p+q is changed to CSD*(X)=p+q, whereCSD*(X):=sup{PN:maxr,s{1,,d}k=|k|Psupi1E(Xi[r]Xik[s])<}.

We also define, for each PN0, thatΣP,*[r,s]:=k=|k|Psupi1E(Xi[r]Xik[s])andΞ*[r,s]:=Σ0,*[r,r]Σ0,*[s,s]+(Σ0,*[r,s])2.

Note that CSD*(X)=P implies that ΣP,*[r,s]< and Ξ*[r,s]< for each r, s. Under the modified regularity conditions, the conclusions of Theorems 4.1 and 4.2 are updated to(28) limsupnl2q{E(Σ̂p,q,n[r,s])Σp,n[r,s]}2(Σp+q,*[r,s])2,(28) (29) limsupnnl1+2pvar(Σ̂p,q,n[r,s])4q2(1+c1)Ξ*[r,s](2p+1)(2p+q+1)(2p+2q+1)(29)

for all r,s{1,,d}. If l=O(n1/(1+2p+2q)), then (28) and (29) imply thatlimsupnn2q/(1+2p+2q)E(Σ̂p,q,n[r,s]Σp,n[r,s])2Cfor some C<. Hence, Σ̂p,q,n is a consistent estimator of Σp,n with the optimal convergence rate. Examples and finite-sample performance of Σ̂p,q,n in the heteroscedastic case are shown in Section 5.3.

5 Finite Sample Performance

5.1 Efficiency and Robustness Against One Jump

We compare Σ̂0,q,n with the following estimators in terms of efficiency and robustness.

  • (CV) Crainiceanu and Vogelsang (Citation2007) proposed to estimate one potential CP D1 and then construct a de-trended process, say {X̂iCV}. The modified OBM estimator σ̂CV,n2 is defined by applying the estimator (3) to {X̂iCV} instead of {X̂i}. Andrews (Citation1991)’s AR(1)-plug-in rule is used for selecting the optimal batch size.

  • (WZ) Wu and Zhao (Citation2007) used NBMs {Ak} to estimate σ2. They proposed σ̂WZ1,n2:=πl{4(m1)2}1k=2m|AkAk1|,σ̂WZ2,n2:=l(2z3/4)1mediank{2,,m}|AkAk1|, and σ̂WZ3,n2 defined in (22), where mediankKxk denotes the median of {xk}kK, and zp is the 100p% quantile of N(0,1). They showed, under regularity conditions, that σ̂WZ1,n2 and σ̂WZ2,n2 are weakly consistent if l=[[n5/8]], and that σ̂WZ2,n2 is L2 consistent with MSE(σ̂WZ3,n2)=O(n2/3) if ln1/3. For σ̂WZ3,n2, we implement it with the an estimated optimal bandwidth by using our proposed estimator (see Section 4.2 for more details). Denote these three estimators by WZ1, WZ2, and WZ3, respectively.

  • (AC) Altissimoa and Corradic (Citation2003) proposed using Bartlett kernel estimator after locally detrending the mean. The bandwidth is selected by cross-validation. Denote the resulting estimator by σ̂AC,n2. They proved that σ̂AC,n2 is consistent (see ).

  • (MAC) We use the estimators Σ̂0,1,n and Σ̂0,2,n, as well as the pilot estimator Σ̂0,2,n. Denote them by MAC(1), MAC(2), and MAC(P), respectively.

Their detailed formulas are presented in Section C.1 of the supplementary materials for reference. We recall from that σ̂CV,n2 is robust to one CP without trend; σ̂WZ1,n2, σ̂WZ2,n2 and σ̂WZ3,n2 are proved to be robust to trends only; σ̂AC,n2 is only proved to be robust to finitely many CPs; and the proposed estimators Σ̂0,1,n and Σ̂0,2,n are robust to both trends and a divergent number of CPs. If there is at most one CP, then σ̂CV,n2 is an oracle estimator because, in practice, we rarely know that there is at most one CP.

Consider the ARMA(1,1) model: Yi=Xi+μi where Xi=aXi1+bεi1+εi and εiiidN(0,1), for i=1,,n. In particular, consider a=b=0.2, 0.4, 0.6 (Models A1–A3, respectively), n=400×4j,j=0,,3, and five different mean sequences μi=ξ×1{in/2} for ξ=0,,4. The MSEs are estimated by using 2000 independent replications. The lack of efficiency (MSE0) is measured by the MSE when ξ = 0, whereas the lack of robustness is measured by the standard derivation (MSEsd) of the MSEs across ξ{0,1,,4}. Smaller MSE0 and smaller MSEsd imply higher efficiency and robustness, respectively.

The results are shown in . Clearly, σ̂WZ1,n2 and σ̂WZ2,n2 perform badly in terms of both efficiency and robustness. The major competitor σ̂WZ3,n2 performs reasonably well in terms of both two measures, however, it is less efficient than all of our proposed estimators (Σ̂0,1,n, Σ̂0,2,n, Σ̂0,2,n) in nearly all cases. The estimator σ̂AC,n2 is quite efficient when the mean is a constant, however, it loses all of its efficiency when the jump size is large. For example, when n = 400, its MSE inflates 407% when the jump magnitude ξ increases from 0 to 4. Besides, the cross-validation step makes it computationally inefficient.

Fig. 2 The values of log(MSE0) and log(MSEsd) are plotted against n, where MSE0 denotes the MSE when the jump size ξ = 0, and MSEsd denotes the standard deviation of the MSEs across different ξ. Recall that smaller MSE0 and smaller MSEsd imply higher efficiency and robustness, respectively. Note that σ̂AC,n2 is computed only when n400 because it requires a computationally intensive cross-validation step. Note that horizontal axis is plotted in the logarithmic scale for better visualization.

Fig. 2 The values of log (MSE0) and log (MSEsd) are plotted against n, where MSE0 denotes the MSE when the jump size ξ = 0, and MSEsd denotes the standard deviation of the MSEs across different ξ. Recall that smaller MSE0 and smaller MSEsd imply higher efficiency and robustness, respectively. Note that σ̂AC,n2 is computed only when n≤400 because it requires a computationally intensive cross-validation step. Note that horizontal axis is plotted in the logarithmic scale for better visualization.

The proposed estimators Σ̂0,1,n and Σ̂0,2,n perform the best in nearly all cases. The advantage of Σ̂0,2,n is increasingly obvious when n increases. The pilot estimator Σ̂0,2,n performs quite well, so it is justifiable to use it as an initial guess. It is remarked that Σ̂0,2,n performs very well in Model A3 because its default tuning parameter accidentally matches the theoretically optimal value. However, this privilege is not general (see, e.g., of another experiment in Section 5.3).

5.2 Robustness Against Trend and Multiple Jumps

In this subsection, we investigate the robustness against both trends and jumps. Consider the same models of {Xi} in Section 5.1, but the mean function is replaced by μi=(i/n) 1{0.4i/n<0.7}+(5i/n4)2 1{i/n0.7}. shows a typical realization of {Yi} in Model A2. Observe that the trend effect and jump effect are not obvious because they are masked by the intrinsic variability of the noises {Xi}. This scenario mimics the situation in which the observed time series looks stationary but, indeed, it has been contaminated by a hardly noticeable nonconstant trend and structural breaks.

Fig. 3 Thin solid line: A realization of {Yi} in Model A2 of length n = 400. Thick solid line: The nonconstant mean function {μi} in Section 5.2. Dotted vertical lines: The change points.

Fig. 3 Thin solid line: A realization of {Yi} in Model A2 of length n = 400. Thick solid line: The nonconstant mean function {μi} in Section 5.2. Dotted vertical lines: The change points.

The simulation result is visualized in . First, the MSE of the previous oracle estimator σ̂CV,n2 does not decrease with n because it is no longer consistent when the mean is not a piecewise constant function. The estimators σ̂WZ1,n2 and σ̂WZ2,n2 perform poorly again. The estimator σ̂WZ3,n2 and our proposed Σ̂0,1,n, Σ̂0,2,n, Σ̂0,2,n perform well. Among them, σ̂WZ3,n2 performs least well, whereas Σ̂0,2,n and Σ̂0,2,n perform most promisingly. The take-home message is that even if the trend is relatively insignificant, the impact on the estimators of σ2 can be catastrophic especially when the mean-structure is misspecified.

Fig. 4 The values of log{MSE(·)} of different estimators are plotted against the sample size n in Models A1–A3. Here the mean function consists of nonconstant trends and multiple jumps (see Section 5.2 and ). Note that horizontal axis is plotted in the logarithmic scale.

Fig. 4 The values of log {MSE(·)} of different estimators are plotted against the sample size n in Models A1–A3. Here the mean function consists of nonconstant trends and multiple jumps (see Section 5.2 and Figure 3). Note that horizontal axis is plotted in the logarithmic scale.

5.3 Multivariate Time Series With Heteroscedastic Errors

We consider estimation of Σ0,n (defined in (27)) for a bivariate time series {Yi=(Yi1,Yi2)}i=1n with time-varying means and heteroscedastic errors. Let Yij=μij+τijXij for i=1,,n and j = 1, 2, where μij is the mean, Xij is a stationary noise, and τij creates heteroscedasticity. Two mean sequences are used: (i) μij=0 for all i, j, and (ii) μi1=i/n,μi2=1(i/n>1/3). We set τi1=1+i/(4n),τi2=1+sin(4πi/n)/(4n), and generate {Xij} as follows:[Xi1Xi2]=[0.270.090.180.18][Xi1,1Xi1,2]+[0.010.140.280.08]εi1+εi,i=1,2,,n,where ε0,,εn are independent standard bivariate normal random vectors.

The proposed estimators Σ̂0,1,n, Σ̂0,2,n, and Σ̂0,2,n are evaluated. We compare them with the standard Bartlett kernel estimator Σ̂Bart,n and QS kernel estimators Σ̂QS,n (see (4)). The bandwidths of Σ̂Bart,n and Σ̂QS,n are selected by Andrews’s (Citation1991) vector AR(1)-plug-in rule. As far as we know, in the multivariate setting, there exists no other estimator that is proved to be consistent and optimal in the presence of nonconstant mean, autocorrelation and heteroscedasticity. The results are shown in . We also repeat the experiment with homoscedastic errors, that is, τij=1 for all i, j. Since the results are very similar to the heteroscedastic case, we only present the result in of the supplementary materials.

Fig. 5 The values of log{MSEW(·)} for Σ̂Bart,n, Σ̂QS,n,Σ̂0,1,n, Σ̂0,2,n and Σ̂0,2,n in the heteroscedastic case are plotted against n, where vec(W)=(1,1/2,1/2,1) is used, and MSEW(·) is defined in (16). The left and right plots show the results in the constant mean and nonconstant mean cases, respectively. Note that the horizontal axes are plotted in the logarithmic scale.

Fig. 5 The values of log {MSEW(·)} for Σ̂Bart,n, Σ̂QS,n,Σ̂0,1,n‡, Σ̂0,2,n‡ and Σ̂0,2,n† in the heteroscedastic case are plotted against n, where vec(W)=(1,1/2,1/2,1)⊺ is used, and MSEW(·) is defined in (16). The left and right plots show the results in the constant mean and nonconstant mean cases, respectively. Note that the horizontal axes are plotted in the logarithmic scale.

From , all five estimators are consistent in the constant-mean case. However, Σ̂Bart,n and Σ̂QS,n are no longer consistent when the mean is not a constant. On the other hand, the mean-structure does not affect the performance of Σ̂0,1,n, Σ̂0,2,n, and Σ̂0,2,n. It verifies the claimed consistency and robustness. In addition, although the pilot estimator Σ̂0,2,n does not perform as well as the optimal estimator Σ̂0,2,n, it is still able to give sufficiently good results. It supports the use of Σ̂0,2,n as an initial estimator in practice.

5.4 Change-Point Detection

In this subsection, we consider the CP detection problem, that is, to test H0:EY1==EYn against H1:D1such that EY1==EYD11EYD1==EYn. We analyze (i) whether CP tests are monotonically powerful with respect to the magnitude of jump |EYD1EYD11|; and (ii) their power losses under a misspecified alternative hypothesis.

Let Tn(k):=n1/2i=1kX̂i be the CUSUM process of X̂i=YiY¯n. The standard KS test statistic is defined by Tn:=maxk{1,,n}|Tn(k)/σ̂|, where σ̂ is a consistent estimator of σ. Then, H0 is rejected at 5% level if Tn>1.358. Alternatively, a self-normalized KS test (Shao and Zhang Citation2010) can be used. Following them, we compare

  • (SZ) their self-normalized KS test, and

  • (KS) the standard KS tests with different estimators of σ2, namely, σ̂A,n2, σ̂CV,n2 σ̂JX,n2, σ̂WZ3,n2, σ̂AC,n2, and Σ̂0,2,n, where σ̂A,n2 is Bartlett kernel estimator with Andrew’s AR(1) plug-in selector of l; the estimator σ̂JX2 is proposed by Juhl and Xiao (Citation2009); and all other estimators are defined in Section 5.1.

Detailed formulas of the above CP tests and estimators of σ2 are presented in Section C.3 of the supplementary materials for reference. Consider the bilinear model: Yi=Xi+μi where Xi=(a+bεi)Xi1+εi and εiiidN(0,1), for i=1,,n. The physical dependence measure decays at the rate δ4,n[1]=O(ϱn), where ϱ=a2+b2 (see Wu Citation2005, Citation2011). If ϱ is larger, the serial dependence is stronger. We use a=0.33, 0.36, 0.39 and b=0.5, 0.6, 0.7 so that ϱ=0.6, 0.7, 0.8, respectively. Denote them by Models B1–B3, respectively.

Both SZ and KS tests assume that the mean function is a piecewise constant with one CP when H0 is false. If it is actually the case, we call that the alternative hypothesis is correctly specified, otherwise, the alternative hypothesis is said to be misspecified. We consider the following two alternative hypotheses in the experiments:

  • (correctly specified alternative) H1: μi=ξ× 1{i/n>1/4} for i=1,,n; and

  • (misspecified alternative) H1: μi=ξ(1+e10i/n+5)× 1{i/n>1/4} for all i=1,,n,

where the value of ξR controls the jump magnitude. If ξ = 0, both H1 and H1 reduce to H0. In H1, the mean jumps to ξ at i=n/4+1, and then stays constant; whereas, in H1, the mean shoots up at i=n/4+1, and then decays to ξ. In practice, CP may arrive like H1 instead of H1, hence, a good CP test should be powerful in both cases. A good size-α CP test should satisfy the following four properties, where α(0,1).

  • (Size correctness) The probability of rejecting H0 is close to α when H0 is correct.

  • (Powerfulness) The probability of rejecting H0 is high when H0 is incorrect.

  • (Monotonicity of power) The power is increasing with the magnitude of jump |ξ|.

  • (Robustness) The test is still powerful under misspecified alternative hypotheses.

The simulation is conducted for n = 100, 400, 800 with nominal size α=5%. Since the results are similar under different models, we only report the results under Model B2 here (see ). The full results are deferred to Section C.3 in the supplementary materials. The size-adjusted power curves are also presented in the supplementary materials for reference. Under H1, all tests except KS(A) and KS(JX) have monotonic powers with respect to |ξ|. The test KS(WZ3) commits the Type I error more frequently than the nominal value even when the sample size is large. This over-size phenomenon is due to the use of inefficiency estimator of σ2. The power curves are largely the same for KS(CV), KS(AC), and KS(MAC(2)) as they are essentially the same test. Observe that SZ is significantly less powerful when 0<ξ<1.

Fig. 6 The powers of the CP tests defined in Section 5.4 are plotted against the jump magnitude ξ under Model B2. The scenarios under well-specified alternative H1 and misspecified alternative H1 are shown in the upper and lower plots, respectively. Dashed horizontal lines indicate the significance level α=5% and zero. Note that horizontal axis is plotted in the logarithmic scale for better visualization.

Fig. 6 The powers of the CP tests defined in Section 5.4 are plotted against the jump magnitude ξ under Model B2. The scenarios under well-specified alternative H1 and misspecified alternative H1′ are shown in the upper and lower plots, respectively. Dashed horizontal lines indicate the significance level α=5% and zero. Note that horizontal axis is plotted in the logarithmic scale for better visualization.

Under H1, all tests except KS(MAC(2)) and KS(WZ(3)) immediately lose all power when |ξ|>0. In particular, SZ remains powerless even when n and ξ are large. It is not desirable because SZ is very sensitive to whether the alternative hypothesis is well-specified. For KS(A), KS(CV), KS(JX), and KS(AC), they are not powerfully because of using inconsistent or inefficient estimators of σ2. It gives a sign of warning to use these tests in practice. It is worth emphasizing that KS(WZ3) seems more powerful than KS(MAC(2)). However, it is just because KS(WZ3) rejects too frequently no matter H0 is true or not. Hence, the apparently more powerful KS(WZ3) test is not reliable. Among all tests above, our proposed test KS(MAC(2)) is the only monotonically powerful test that has accurate size and is insensitive to misspecification of the alternative hypothesis.

6 Empirical Studies

6.1 Change Point Detection in S&P 500 Index

The Standard & Poor’s 500 (S&P 500) Index is a stock market index based on 500 representative companies in the USA. The daily adjusted close prices of the index, from 3 January 2006 to 30 December 2011 (n = 1511), are investigated. The dataset can be downloaded from http://finance.yahoo.com/quote/%5EGSPC/history. The financial crisis in 2008 is believed to have a tremendous impact on the global stock market. We suspect that it led to an abrupt change in the stock market. Testing this claim is important since a noncontinuous impact implies that the economy may have a structural change.

Denote the logarithm of the S&P 500 Index by Yi. Observe that there is an obvious trend in Yi (see ). A standard approach is to study the return series yi:=YiYi1 to get rid of the trend component. This differencing step is essential for many standard CP tests, for example, SZ and KS tests presented in Section 5.4, because they cannot handle trends. Using the CUMSUM-type CP estimator D̂1 (see (1) of the supplementary materials for its formula), we estimate the CP to be 10 March 2009. It is remarked that the same CP is detected by the method described in Altissimoa and Corradic (Citation2003). Hence, the CP test fails to capture the 2008 financial crisis. Indeed, testing H0: “Ey1==Eyn” by the KS(MAC) test defined in Section 5.4, we fail to reject H0 at 5% level. We conclude that the 2008 financial crisis has no jump impact on the return yi. Since taking the difference of Yi may cancel out the potential jump effect, it seems desirable to analyze Yi directly (see Vogelsang Citation1999 for a similar analysis). Using the CP test proposed by Wu and Zhao (Citation2007), we can test H0: “The mean function iEYi is continuous” against H1: “The mean function iEYi has a jump-discontinuity.” The test statistic is Qn:=(knσ̂)1maxkninkn|j=i+1kn+iYjj=ikn+1iYj|, where σ̂2 is a consistent estimator of σ2, that is, the AVC of {Yi}; and kn=n0.6. Then H0 is rejected if Qn is large. Using MAC(2) to estimate σ, we obtain σ̂=0.0517; and found that H0 is rejected at any reasonable level. It is remarked that σ is estimated to be 0.0434 by using the estimator WZ3. Although this estimate is a bit smaller than our proposed estimate, the same conclusion for testing H0 is obtained if this estimate is used in the test statistic Qn. Although Wu and Zhao (Citation2007) did not provide any estimator of the CP, they argue that if i + 1 is a discontinuity point, then the difference of the averages inside the statistic Qn should be large. Following their idea, D̂WZ:=1+argmaxkninkn|j=i+1kn+iYjj=ikn+1iYj| is a reasonable estimator of the CP. The estimated CP, D̂WZ, is 7 October 2008 (see ). It indicates the 2008 financial crisis quite accurately. It coincides with our understanding of the stock market.

Fig. 7 Time series plot of {Yi}, that is, the daily S&P 500 Index (3 January 2006–30 December 2011) in the log scale (see Section 6.1). The vertical dotted line indicates the value of D̂WZ estimated by the statistic in Wu and Zhao (Citation2007). Here σ2 is estimated by MAC(2).

Fig. 7 Time series plot of {Yi}, that is, the daily S&P 500 Index (3 January 2006–30 December 2011) in the log scale (see Section 6.1). The vertical dotted line indicates the value of D̂WZ estimated by the statistic in Wu and Zhao (Citation2007). Here σ2 is estimated by MAC(2).

6.2 Simultaneous Change Point Detection in Several Indices

Besides S&P 500 Index mentioned in Section 6.1, there are several other stock market indices that are commonly used by traders, for example, Dow Jones Index, Nasdaq Composite, and Russell 2000. In this subsection, we investigate whether we can make use of these four market indices simultaneously to make a more precise detection of the 2008 financial crisis.

Consider the squared daily returns, which can be used as proxies for daily volatilities, of the aforementioned four indices in the period 1 July 2008–30 December 2008 (see ). Applying the CUSUM-type CP estimator D̂1 (see (1) of the supplementary materials for its formula) to each index individually, we obtain the same CP 29 September 2018. It is remarked that the no CP null hypothesis is rejected at 5% level by the test KS(MAC(2)) for each individual index.

Fig. 8 The squared returns of four stock indices (1 July 2008–30 December 2008) (see Section 6.2). The two vertical lines denote the CP locations. The earlier and later CPs are detected by the multivariate and univariate CUSUM CP estimators, respectively.

Fig. 8 The squared returns of four stock indices (1 July 2008–30 December 2008) (see Section 6.2). The two vertical lines denote the CP locations. The earlier and later CPs are detected by the multivariate and univariate CUSUM CP estimators, respectively.

Since these stock market indices are highly correlated and are believed to follow the market trend very closely, a CP (if any) is likely to appear simultaneously. Hence, using multivariate time series for detecting a CP can be more accurate and precise. Applying the multivariate version of the KS CP test (Horváth, Kokoszka, and Steinebach Citation1999) to the four indices, we detect the CP to be 15 September 2008. From , the squared returns between 15 and 29 September are slightly higher than the first portion of the series. Hence, using multivariate time series helps detecting these small changes. Consequently, multivariate tests are potentially more useful in practice. It is also remarked that the no simultaneous CP alternative is rejected at 5% level by the CP test (Horváth, Kokoszka, and Steinebach Citation1999) with our proposed MAC(2) estimator.

7 Conclusions

In this article, we propose an estimator of the ACM in nonstationary time series. The estimator has several desirable features: (i) it is robust against unknown trends and a divergent number of jumps; (ii) it is optimal in the sense that an asymptotically correct optimal bandwidth can be implemented robustly; (iii) it is statistically efficient since it has the optimal L2 convergence rate for different strength of serial dependence; (iv) it is computationally fast because neither numerical optimization, trend estimation, nor CPs detection is required; and (v) it is handy because its formula can be as simple as (24).

Some applications of the estimator are illustrated. In particular, we found that the CP test equipped with the proposed estimator is the only available test which is monotonically powerful and insensitive to a misspecified alternative hypothesis.

Supplementary Materials

Supplementary materials include graphical illustration, additional simulation results, and proofs. The R-package MAC for computing the proposed estimator is also provided.

Supplemental material

Supplemental Material

Download Zip (754.5 KB)

Acknowledgments

The author thanks the editor Christian Hansen, the associate editor, and two reviewers for their detailed and insightful comments. The author also gratefully thanks Neil Shephard for his helpful advice on improving the estimator as well as Xiao-Li Meng, Jim Stock and Pierre Jacob for fruitful discussions.

Additional information

Funding

This research was supported by the Direct Grant (4053356) provided by the Chinese University of Hong Kong, and the Early Career Scheme (24306919) provided by the University Grant Committee of HKSAR.

References