673
Views
0
CrossRef citations to date
0
Altmetric
Articles

A posterior convergence rate theorem for general Markov chains

Pages 5910-5921 | Received 05 Nov 2020, Accepted 21 Dec 2021, Published online: 08 Feb 2022

Abstract

This paper establishes a posterior convergence rate theorem for general Markov chains. Our approach is based on the Hausdorff α-entropy introduced by Xing (Electronic Journal of Statistics 2:848–62, 2008) and Xing and Ranneby (Journal of Statistical Planning and Inference 139 (7):2479–89, 2009). As an application we illustrate our results on a non linear autoregressive model.

AMS CLASSIFICATION:

1. Introduction

The aim of this paper is to study the asymptotic behavior of posterior distributions based on observations which arise from Markov chains. Let X0,X1, be a Markov chain with transition density pθ(y|x) and initial density qθ(x0) with respect to some σ-finite measure μ on a measurable space (X,A). We assume that the function xqθ(x) and the 2-variable function (x,y)pθ(y|x) are measurable for all parameters θ in the parameter set Θ. So the joint distribution Pθ(n) of X(n)={X0,X1,,Xn} has a density given by pθ(n)(x(n))=qθ(x0)i=1npθ(xi|xi1) relative to the product measure μ(x0)μ(x1)μ(xn), where the parameter θ does not depend on the size of X(n). Denote by θ0 the true parameter generating the observations X0,X1,. Note that any semimetric d((qθ1,pθ1),(qθ2,pθ2)) on the product space of the initial densities and the transition densities induces naturally a semimetric d(θ1,θ2)=d((qθ1,pθ1),(qθ2,pθ2)) on Θ when the mapping θ(qθ,pθ) is one-to-one, which is assumed in the paper. Given a prior Π on Θ, the posterior distribution Π(· |X0,X1,,Xn) is a random probability measure given by Π(B |X0,X1,,Xn)=B pθ(n)(X(n)) Π(dθ)Θ pθ(n)(X(n)) Π(dθ)=B Rθ(n)(X(n)) Π(dθ)Θ Rθ(n)(X(n)) Π(dθ) for each measurable subset B in Θ, where Rθ(n)(X(n))=pθ(n)(X(n))/pθ0(n)(X(n)) stands for the likelihood ratio.

Recall that the posterior distribution Π(· |X0,X1,,Xn) is said to be convergent almost surely at a rate at least εn if there exists r > 0 such that Π(θΘ: d(θ,θ0)rεn |X0,X1,,Xn)0almost surely asn. Posterior consistency is an important issue in Bayesian analysis. Much works were concerned with the asymptotic behavior of posterior distributions for independent and identically distributed observations, see, for instance, Barron, Schervish, and Wasserman (Citation1999), Shen and Wasserman (Citation2001), Ghosal and van der Vaart (Citation2007), Walker, Lijoi, and Prunster (Citation2007), Walker (Citation2003), Walker (Citation2004), Xing and Ranneby (Citation2009), Xing (Citation2011a) and Xing (Citation2011b). An old well-known approach is based on the existence of uniformly consistent tests. In this paper we use an integration condition together with the Hausdorff α-entropy to study convergence rates of posteriors when the observations are not independent and identically distributed. The integration condition and the Hausdorff α-entropy have an advantage in applications, because they both are prior-dependent. The Hausdorff α-entropy condition was introduced in Xing (Citation2008) and Xing and Ranneby (Citation2009) and it is weaker than the metric entropy condition. By means of the integration condition and the Hausdorff α-entropy, we establish a posterior convergence rate theorem for general Markov chains. As applications we discuss the posterior rate of convergence for the non linear autoregressive model.

The layout of this paper is as follows. In Sec. 2 we present a prior-dependent integration inequality and show one type of general posterior convergence rate theorem for Markov chains. In Sec. 3 we illustrate our result by finding a posterior convergence rate for non linear autoregression model. The technical proofs are collected in the Appendix.

2. A convergence rate theorem for Markov chains

In this section we introduce a prior-dependent integration condition to consistency of posterior distributions. Together with the Hausdorff α-entropy, the integration condition plays a central roll in study of the Bayesian convergence rate.

Recall that the Hausdorff α-entropy J(δ,Θ1,α,d) for a subset Θ1Θ is the logarithm of the minimal sum of αth power of prior masses of balls of d-radius δ needed to cover Θ1, see Xing (Citation2008) and Xing and Ranneby (Citation2009) for the details of the Hausdorff α-entropy. For simplicity of notations, we define the Hausdorff α-constant C(δ,Θ1,α,d):=eJ(δ,Θ1,α,d) of any subset Θ1 of Θ. Observe that C(δ,Θ1,α,d) depends on the prior Π. It was proved in Xing and Ranneby (Citation2009) that the inequality Π(Θ1)αC(δ,Θ1,α,d)Π(Θ1)α N(δ,Θ1,d)1α holds for any 0α1, where N(δ,Θ1,d) denotes the minimal number of balls of d-radius δ needed to cover Θ1Θ. We shall adopt the following Hellinger type semimetrics. H(pθ1(y|x),pθ2(y|x))=(XX(pθ1(y|x)pθ2(y|x) )2 dμ(y)dν(x))1/2,H(qθ1(x),qθ2(x))=(X(qθ1(x)qθ2(x) )2 dμ(x))1/2,H*(pθ1(y|x),pθ2(y|x))=(XX(pθ1(y|x)pθ2(y|x) )2(23pθ1(y|x)pθ2(y|x)+13) dμ(y)dν(x))1/2,H*(qθ1(x),qθ2(x))=(X(qθ1(x)qθ2(x) )2(23qθ1(x)qθ2(x)+13) dμ(x))1/2. Denote Wn1(θ0,ε)={θΘ: H*(pθ0,pθ)2+1nH*(qθ0,qθ)2ε2}. By means of the metric d(θ,θ0):=H(pθ,pθ0), Ghosal and van der Vaart (Citation2007, Theorem 5) gave an in-probability posterior convergence rate theorem for stationary α-mixing Markov chains. Since calculation of the α-mixing coefficients is generally not easy and many processes are neither mixing nor stationary, it seems worth to develop a posterior convergence rate theorem for Markov chains which may be neither stationary nor α-mixing. Now we present an almost sure assertion in this direction. Our result is based on the following prior-dependent integration condition.

Throughout this paper the notation ab means aCb for some positive constant C which is universal or fixed in the proof. Write ab if ab and ba. Denote Pfα=XfαdP which is the integral of the non negative function f with power α relative to the measure P on X.

Proposition 1.

Suppose that there exist a μ-integrable function r(y) and constants a1a0>0 with a11 such that dν(y)=r(y)dμ(y) and a0r(y)pθ(y|x)a1r(y) for all θΘ and x,yX. Let 0<δ<a02a1 and 0<α<12. Then the inequalityPθ0(n) ({θΘ1:d(θ,θ0)>ε}qθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ))α2 e(12α)(a02a1δ)2nε2C(δ ε,{θΘ1: d(θ,θ0)>ε},α,d)holds for all n, ε>0 and Θ1Θ, where d(θ,θ0)=H(pθ,pθ0).

Therefore we have

Theorem 1.

Suppose that all assumptions of Proposition 1 hold and suppose that εn>0,n εn2c0 logn for all large n and some fixed constant c0>0. Suppose that there exist c1<(12α)(a02a1δ)2,c2>1c0 and a sequence of subsets Θn on Θ such thatC(δjεn,{θΘn:jεn<d(θ,θ0)2jεn},α,d)ec1j2nεn2 Π(Wn1(θ0,εn))αfor all large j, n, andn=1enεn2(3a1+4c2) Π(ΘΘn)Π(Wn1(θ0,εn))<.

Then there exists b > 0 such that for each large r and all large n,Π(θΘ: d(θ,θ0)r εn| X0,X1,,Xn)ebnεn2almost surely.

By choosing δ=a04a1 and α=14 we can easily get

Corollary 1.

Suppose that there exist a μ-integrable function r(y) and constants a1a0>0 such that dν(y)=r(y)dμ(y) and a0r(y)pθ(y|x)a1r(y) for all θΘ and x,yX. Suppose that εn>0,n εn2c0 logn for all large n and some fixed constant c0>0. Suppose that there exist c1,c2,c3 with 3c1+c2<a0/16 and c3>1/c0 and a sequence of subsets Θn on Θ such that for all large j and n,

  1. N(a04a1jεn,{θΘn:jεn<d(θ,θ0)2jεn},d)ec1j2nεn2;

  2. Π(θΘn:jεn<d(θ,θ0)2jεn)ec2j2nεn2 Π(Wn1(θ0,εn));

  3. n=1enεn2(3a1+4c3) Π(ΘΘn)Π(Wn1(θ0,εn))<.

Then there exists b > 0 such that for each large r and all large n,Π(θΘ: d(θ,θ0)r εn| X0,X1,,Xn)ebnεn2almost surely.

3. Non linear autoregression

In this section we discuss an application of our theorems. By means of Corollary 1, we improve on the posterior rate of convergence for the non linear autoregressive model in Ghosal and van der Vaart (Citation2007).

We observe X1,X2,,Xn of a time series {Xt:tZ} given by Xi=f(Xi1)+ξifori=1,2,,n, where ξ1,ξ2,,ξn are i.i.d. random variables with the standard normal distribution and the unknown regression function f is in the space F which consists of all functions f with supxR|f(x)|M for some fixed positive constant M. Let qf(x) be the density of X0 relative to the Lebesgue measure dμ on R. So X0,X1, can be considered as a Markov chain generated by the transition density pf(y|x)=ϕ(yf(x)) with ϕ(x)=(2π)1/2ex2/2 and the initial density qf(x). Since ϕ(x) is a strictly positive continuous function tending to zero as x±, there exist two constants 0<a0<1<a1 depending only on M such that a0ϕ(y)pf(y|x)a1ϕ(y) for all fF and <y, x<. Assume that there exists a constant N > 0 such that the set of initial densities of the Markov chain satisfies H*(qf1,qf2)N for all initial densities qf1 and qf2. For instance, all of the initial densities with a0ϕ(x)qf(x)a1ϕ(x) satisfy H*(qf1,qf2)2(a1/a0)1/4 and hence form a set with the requirement. Define a measure dν=ϕdμ in R and a norm ||f||2=(R|f|2dν)1/2 on F. Assume that the true regression function f0F belongs to the Lipschitz continuous space LipM, which consists of all functions f on (,) satisfying |f(x)f(y)|L |xy| for all <x, y<, where L is a fixed positive constant. When the Markov chain is stationary, Ghosal and van der Vaart (Citation2007, Section 7.4) constructed a prior on the regression functions and obtained the in-probability posterior convergence rate n1/3(logn)1/2, which is the minimax rate times the logarithmic factor (logn)1/2. In the following we shall apply Corollary 1 to get the posterior convergence rate n1/3(logn)1/6 in the almost sure sense for a general Markov chain defined as above.

First, we note that for any fF, H*(pf0,pf)2+1n H*(qf0,qf)2a1a0H(pf0,pf)2+N2n=12a1a0(1e(f(x)f0(x))24) dν(x)+N2n||ff0||228a1a0+N2n, where the last inequality follows from the elementary inequality 1ett. Hence for some small constant b1>0 we have that Wn1(f0,εn){fF:||ff0||2b1 εn} for all large n. Similarly, ||ff0||2H(pf,pf0) hold for all fF with ||ff0||21. Hence Corollary 1 works well for the metric ||·||2.

We also need some basic facts on approximation of Lipschitz continuous functions by means of step functions. Given a finite interval [An,An) and a positive integer Kn, we make the partition [An,An)=k=1KnIk with Ik=[An+2An(k1)Kn,An+2An·kKn) for k=1,2,,Kn. Write I0=R[An,An). The space of step functions relative to the partition is the set of functions h:[An,An)R such that h is identically equal to some constant on each Ik for k=1,2,,Kn, more precisely, h(x)=k=1Knβk 1Ik(x) for some β=(β1,β2,,βKn)[M,M]KnRKn, where 1Ik(x) denotes the indicator function of Ik. Denote by fβ(x) the function on (,) which is equal to k=1Knβk 1Ik(x) on [An,An) and vanish outside [An,An). Hence fβF and ||fβ1fβ2||2=||β1β2||*, where ||β||*=(k=1Knβk2(Ikdν)2)1/2. Let Π be the prior on F which is induced by the map βfβ such that all the coordinates βk of β are chosen to be i.i.d. random variables with the uniform distribution on [M,M]. Hence the support FΠ of Π consists of all such functions fβ. Take An=2log(1/εn)logn and Kn=3LAnb1εn+1 with εn=(lognn)1/3. Then Kn(nlogn)1/3nεn2. Write β0=(β0,1,β0,2,,β0,Kn) for β0,k=f0(An+2Ank1Kn). Since f0FLipL, we have that fβ0F and supAnx<An| fβ0(x)f0(x) |LAn/Knb1εn/3. From the triangle inequality and the inequality xϕ(t)dtϕ(x)/x for all x > 0, it follows that for all fβFΠ and for all large n, | ||fβf0||2||fβfβ0||2 |||fβ0f0||2=(AnAn|f0fβ0|2 dν)1/2+(I0f02 dν)1/2b1εn3(AnAndν)1/2+M(ϕ(An)An)1/2b1εn3+Mεn(2π)1/4An1/2b1εn2. Thus for all large j and n, we have Π(fβFΠ:jεn<||fβf0||22jεn)Π(Wn1(θ0,εn))Π(fβFΠ:||fβf0||22jεn)Π(fβFΠ:||fβf0||2b1εn)Π(fβFΠ:||fβf0||23jεn)Π(fβFΠ:||fβfβ0||2b12εn)=Π(β[M,M]Kn:||ββ0||*3jεn)Π(β[M,M]Kn:||ββ0||*b12εn). Note that the Euclidean volume of the Kn-dimensional ellipsoid {βRKn:||ββ0||*r} is equal to rKn times the Euclidean volume of the “unit” Kn-dimensional ellipsoid {βRKn:||ββ0||*1}. So the last quotient does not exceed j2Kn=eKnlog(2j), which is less than ec2j2nεn2 for any given c2>0 and all large j. Hence we have obtained condition (ii) of Corollary 1. Similarly, for all large j and n, we have N(a04a1jεn,{fβFΠ:jεn<||fβf0||22jεn},||·||2)N(a04a1jεn,{fβFΠ:||fβfβ0||23jεn},||·||2)N(a04a1jεn,{β[M,M]Kn:||ββ0||*3jεn},||·||*), which, by Lemma 4.1 in Pollard (Citation1990), is less than b2Kn=eKnlogb2 for some constant b2>0, and therefore condition (i) of Corollary 1 holds for any given c1>0.

References

  • Barron, A., M. Schervish, and L. Wasserman. 1999. The consistency of posterior distributions in nonparametric problems. The Annals of Statistics 27 (2):536–61. doi: 10.1214/aos/1018031206.
  • Ghosal, S., and A. W. van der Vaart. 2007. Convergence rates of posterior distributions for noniid observations. The Annals of Statistics 35 (1):192–223. doi:10.1214/009053606000001172.
  • Pollard, D. 1990. Empirical processes: Theory and applications. Hayward, CA: IMS.
  • Shen, X., and L. Wasserman. 2001. Rates of convergence of posterior distributions. The Annals of Statistics 29 (3):687–714. doi:10.1214/aos/1009210686.
  • Walker, S. 2003. On sufficient conditions for Bayesian consistency. Biometrika 90 (2):482–488. doi:10.1093/biomet/90.2.482.
  • Walker, S. 2004. New approaches to Bayesian consistency. The Annals of Statistics 32 (5):2028–43. doi:10.1214/009053604000000409.
  • Walker, S., A. Lijoi, and I. Prunster. 2007. On rates of convergence for posterior distributions in infinite-dimensional models. The Annals of Statistics 35 (2):738–46. doi:10.1214/009053606000001361.
  • Xing, Y. 2008. On adaptive Bayesian inference. Electronic Journal of Statistics 2:848–62.
  • Xing, Y. 2011a. Rates of posterior convergence for iid observations. Communications in Statistics - Theory and Methods 39 (19):3389–98. doi:10.1080/03610920903177389.
  • Xing, Y. 2011b. Convergence rates of nonparametric posterior distributions. Journal of Statistical Planning and Inference 141 (11):3382–90. doi:10.1016/j.jspi.2010.10.009.
  • Xing, Y., and B. Ranneby. 2009. Sufficient conditions for Bayesian consistency. Journal of Statistical Planning and Inference 139 (7):2479–89. doi:10.1016/j.jspi.2008.11.008.

Appendix

Proof of Proposition 1.

Our proof is mainly based on some elementary inequalities such as Jensen’ inequality and Hölder’s inequality. It is no restriction to assume that n=2k is an even number. Take non empty disjoint subsets Bj,j=1,2,,N, of Θ such that j=1NΠ(Bj)α2 C(δ ε,{θΘ1: d(θ,θ0)>ε},α,d),j=1NBj={θΘ1: d(θ,θ0)>ε} and d-diameters of all Bj do not exceed 2 δ ε. Then by the inequality (x+y)αxα+yα for all x,y0, we get Pθ0(n) ({θΘ1:d(θ,θ0)>ε}qθ(X0)qθ0(X0)i=12kpθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ))αPθ0(n) j=1N(Bjqθ(X0)qθ0(X0)i=12kpθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ))α=j=1NΠ(Bj)α Pθ0(n)(1Π(Bj)Bjqθ(X0)i=12kpθ(Xi|Xi1)qθ0(X0)i=12kpθ0(Xi|Xi1) Π(dθ))α2 C(δ ε,{θΘ1: d(θ,θ0)>ε},α,d)max1jN Pθ0(n)(1Π(Bj)Bjqθ(X0)i=12kpθ(Xi|Xi1)qθ0(X0)i=12kpθ0(Xi|Xi1) Π(dθ))α. we shall use the notations i=10pθ(Xi|Xi1)=1 and Ij,s=Bjqθ(X0)i=1s+1pθ(Xi|Xi1) Π(dθ)Bjqθ(X0)i=1spθ(Xi|Xi1) Π(dθ) for s=0,1,,2k1. For simplicity we also let Ij,s stand for the parameter of the corresponding integral means. Then the last maximum is equal to max1jN Pθ0(n)(Bjqθ(X0)Π(dθ)qθ0(X0)Π(Bj)s=02k1Ij,spθ0(Xs+1|Xs))α=max1jN Pθ0(n)((Bjqθ(X0)Π(dθ)qθ0(X0)Π(Bj)t=1kIj,2t1pθ0(X2t|X2t1))α(t=0k1Ij,2tpθ0(X2t+1|X2t))α), which, by Hölder’s inequality, is less than max1jN (Pθ0(n)(Bjqθ(X0)Π(dθ)qθ0(X0)Π(Bj)t=1kIj,2t1pθ0(X2t|X2t1))2α)12max1jN (Pθ0(n)(t=0k1Ij,2tpθ0(X2t+1|X2t))2α)12:=(max1jN Aj,k)(max1jN Bj,k). Take θjBj for each j. From Jensen’s inequality and the assumption a0r(Xs)pθ(Xs|Xs1)a1r(Xs) it turns out that d(Ij,s,θj)2=XX(Ij,spθj(Xs+1|Xs) )2 dμ(Xs+1)dν(Xs)BjXX(pθ(Xs+1|Xs)pθj(Xs+1|Xs) )2 dμ(Xs+1)qθ(X0)i=1spθ(Xi|Xi1)Bjqθ(X0)i=1spθ(Xi|Xi1) Π(dθ) dν(Xs)Π(dθ)a1a0BjXX(pθ(Xs+1|Xs)pθj(Xs+1|Xs) )2 dμ(Xs+1)dν(Xs)qθ(X0)i=1s1pθ(Xi|Xi1)Bjqθ(X0)i=1s1pθ(Xi|Xi1) Π(dθ) Π(dθ)4a1δ2ε2a0=a1a0Bjd(θ,θj)2qθ(X0)i=1s1pθ(Xi|Xi1)Bjqθ(X0)i=1s1pθ(Xi|Xi1) Π(dθ) Π(dθ)4a1δ2ε2a0 Thus, d(Ij,s,θj)2a1δεa0 and d(Ij,s,θ0)d(θj,θ0)d(Ij,s,θj)(12a1δa0)ε. Write Aj,k2=X2k1(X(X(Ij,2k1pθ0(X2k|X2k1))2α dμ(X2k)) pθ0(X2k1|X2k2) dμ(X2k1))(Bjqθ(X0)Π(dθ)qθ0(X0)Π(Bj)t=1k1Ij,2t1pθ0(X2t|X2t1))2αqθ0(X0)s=02k3pθ0(Xs+1|Xs) dμ(X0)dμ(X1)dμ(X2k2). Take an non negative integer m with 2α12α2m<4α12α. From Hölder’s inequality it turns out that for each j and k, X(Ij,2k1pθ0(X2k|X2k1))2α dμ(X2k)X(Ij,2k1pθ0(X2k|X2k1))α (Ij,2k1pθ0(X2k|X2k1))α dμ(X2k)(X(Ij,2k1pθ0(X2k|X2k1))α·222α dμ(X2k))22α2 (X(Ij,2k1pθ0(X2k|X2k1))α·1α dμ(X2k))α=(X(Ij,2k1pθ0(X2k|X2k1))2α22α dμ(X2k))22α2, which, by repeating the above procedure m more times, does not exceed (X(Ij,2k1pθ0(X2k|X2k1))α2m(12α)+α dμ(X2k))2m(12α)+α2m(X(Ij,2k1pθ0(X2k|X2k1))12 dμ(X2k))α2m1. Thus we get X(X(Ij,2k1pθ0(X2k|X2k1))2α dμ(X2k)) pθ0(X2k1|X2k2) dμ(X2k1)X(112X(Ij,2k1pθ0(X2k|X2k1) )2 dμ(X2k))α2m1pθ0(X2k1|X2k2) dμ(X2k1)(112XX(Ij,2k1pθ0(X2k|X2k1) )2pθ0(X2k1|X2k2)dμ(X2k)dμ(X2k1))α2m1(1a02XX(Ij,2k1pθ0(X2k|X2k1) )2 dμ(X2k)dν(X2k1))12α=(1a0d(Ij,2k1,θ0)22)12αe(12α)(a02a1δ)2ε2. Hence we have Aj,k2e(12α)(a02a1δ)2ε2X2k1(Bjqθ(X0)Π(dθ)qθ0(X0)Π(Bj)t=1k1Ij,2t1pθ0(X2t|X2t1))2αqθ0(X0)s=02k3pθ0(Xs+1|Xs) dμ(X0)dμ(X1)dμ(X2k2). Repeating the same argument k − 1 times one can get that Aj,k2e(12α)(a02a1δ)2kε2X(Bjqθ(X0)Π(dθ)qθ0(X0)Π(Bj))2αqθ0(X0) dμ(X0)e(12α)(a02a1δ)2kε2(XBjqθ(X0)Π(dθ)qθ0(X0)Π(Bj)qθ0(X0) dμ(X0))2α=e(12α)(a02a1δ)2kε2. Similarly, we have Bj,k2e(12α)(a02a1δ)2kε2. Hence we have proved the required inequality and the proof of Proposition 1 is complete. □

To prove Theorem 1 we need the following inequality.

Lemma 1.

If there exists a constant a11 such that Apθ0(y|x)dμ(y)a1Adν(y) for all xX and AA, then the inequalityPθ0(n)(Θqθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ)enε2(3a1+4c)Π(Wn1(θ0,ε)))enε2cholds for all n, ε>0 and c > 0.

Proof of Lemma 1.

Without loss of generality, we may assume that Π(Wn1(θ0,ε))>0. From Jensen’s inequality and Chebyshev’s inequality it follows that Pθ0(n)(Θqθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ)enε2(3a1+4c)Π(Wn1(θ0,ε)) )Pθ0(n)(enε2(3a14+c)(1Π(Wn1(θ0,ε)) Wn1(θ0,ε)qθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ) )14)Pθ0(n)(enε2(3a14+c)1Π(Wn1(θ0,ε)) Wn1(θ0,ε)(qθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1))14 Π(dθ) )Wn1(θ0,ε)Pθ0(n)(qθ0(X0)qθ(X0)i=1npθ0(Xi|Xi1)pθ(Xi|Xi1))14 Π(dθ)enε2(3a14+c)Π(Wn1(θ0,ε)). So it suffices to prove that Pθ0(n)(qθ0(X0)qθ(X0)i=1npθ0(Xi|Xi1)pθ(Xi|Xi1))14e3a14nε2 for all θWn1(θ0,ε). We assume without loss of generality that n is an even number, say n=2k. Write qθ0(X0)qθ(X0)i=1npθ0(Xi|Xi1)pθ(Xi|Xi1)=qθ0(X0)qθ(X0)j=1kpθ0(X2j|X2j1)pθ(X2j|X2j1)j=1kpθ0(X2j1|X2j2)pθ(X2j1|X2j2). From Hölder’s inequality it then turns out that Pθ0(n)(qθ0(X0)qθ(X0)i=1npθ0(Xi|Xi1)pθ(Xi|Xi1))14(Pθ0(n)(qθ0(X0)qθ(X0)j=1kpθ0(X2j|X2j1)pθ(X2j|X2j1))12)12 (Pθ0(n)(j=1kpθ0(X2j1|X2j2)pθ(X2j1|X2j2))12)12:=AkBk. Hence by Fubini’s theorem we get that Ak2 is equal to X2k+1qθ0(X0)32qθ(X0)12j=1k(pθ0(X2j|X2j1)32pθ(X2j|X2j1)12 pθ0(X2j1|X2j2))dμ(X0)dμ(X1)dμ(X2k)=X2k1(X(Xpθ0(X2k|X2k1)32pθ(X2k|X2k1)12 dμ(X2k)) pθ0(X2k1|X2k2) dμ(X2k1))qθ0(X0)32qθ(X0)12j=1k1pθ0(X2j|X2j1)32pθ(X2j|X2j1)12 pθ0(X2j1|X2j2) dμ(X0)dμ(X1)dμ(X2k2), where by the proof of Lemma 1 in Xing (Citation2011b) we have X(Xpθ0(X2k|X2k1)32pθ(X2k|X2k1)12 dμ(X2k)) pθ0(X2k1|X2k2) dμ(X2k1)=X(1+32H*(pθ0(·|X2k1),pθ(·|X2k1))2) pθ0(X2k1|X2k2) dμ(X2k1)=1+X32H*(pθ0(·|X2k1),pθ(·|X2k1))2 pθ0(X2k1|X2k2) dμ(X2k1)1+X3a12H*(pθ0(·|X2k1),pθ(·|X2k1))2 dν(X2k1)=1+3a12H*(pθ0,pθ)2e3a12H*(pθ0,pθ)2. Thus, we have obtained that Ake3a14H*(pθ0,pθ)2Ak1. Repeating the same argument k − 1 times and using a11 one can get Ake3a14kH*(pθ0,pθ)2(Xqθ0(X0)32qθ(X0)12 dμ(X0))12=e3a14kH*(pθ0,pθ)2(1+32H*(qθ0,qθ)2)12e34H*(qθ0,qθ)2+3a14kH*(pθ0,pθ)2. Similarly, we can get that Bke3a14kH*(pθ0,pθ)2. Therefore AkBke34H*(qθ0,qθ)2+3a14nH*(pθ0,pθ)2e3a14nε2 for all θWn1(θ0,ε), and the proof of Lemma 1 is complete. □

Proof of Theorem 1.

Take a constant c such that c2>c>1/c0. Hence enεn2cecc0logn=1/ncc0 and hence n=1enεn2c<. By Lemma 1 and the first Borel-Cantelli lemma, we get that for almost all {X0,X1,,Xn} the inequality Θqθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ)enεn2(3a1+4c)Π(Wn1(θ0,εn)) holds for all large n. But Pθ0(n)(Π(θΘ: d(θ,θ0)r εn|X0,X1,,Xn)ebnεn2)Pθ0(n)(ebnεn2Π(θΘΘn: d(θ,θ0)r εn|X0,X1,,Xn)12)+Pθ0(n)(ebnεn2Π(θΘn: d(θ,θ0)r εn|X0,X1,,Xn)12):=an+bn. Hence, for large n we have an2ebnεn2 Pθ0(n)(Π(θΘΘn: d(θ,θ0)r εn|X0,X1,,Xn))2enεn2(b+3a1+4c)Π(Wn1(θ0,εn))Pθ0(n)({θΘΘn:d(θ,θ0)rεn}qθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ))2enεn2(b+3a1+4c) Π(ΘΘn)Π(Wn1(θ0,εn)), which, by the assumption of Theorem 1, implies that n=1an< if the constant b is small enough.

On the other hand, let Dj={θΘn:jεn<d(θ,θ0)2jεn} and let [r] be the largest integer less than or equal to the constant r. Then, by the inequality (x+y)αxα+yα for all x,y0 and 0<α<1, we get bn=Pθ0(n)(2αeαbnεn2 Π(θΘn: d(θ,θ0)r εn|X0,X1,,Xn)α1)2αeαbnεn2 Pθ0(n)(Π(θΘn: d(θ,θ0)r εn|X0,X1,,Xn)α)2αeαbnεn2 j=[r]Pθ0(n)(Π(Dj|X0,X1,,Xn)α)2αeαnεn2(b+3a1+4c)Π(Wn1(θ0,εn))αj=[r]Pθ0(n)(Djqθ(X0)qθ0(X0)i=1npθ(Xi|Xi1)pθ0(Xi|Xi1) Π(dθ))α which, by Proposition 1 and the inequality assumption of Theorem 1, does not exceed 2αeαnεn2(b+3a1+4c)Π(Wn1(θ0,εn))αj=[r]2 e(12α)(a02a1δ)2n(jεn)2 ec1j2nεn2 Π(Wn1(θ0,εn))α=21+αeαnεn2(b+3a1+4c)j=[r]e(c1(12α)(a02a1δ)2)nεn2j221+αeαnεn2(b+3a1+4c)j=[r]e(c1(12α)(a02a1δ)2)nεn2j=21+αeαnεn2(b+3a1+4c)+(c1(12α)(a02a1δ)2)nεn2[r]1e(c1(12α)(a02a1δ)2)nεn221+αnc0α(b+3a1+4c)+c0(c1(12α)(a02a1δ)2)[r]1n(c1(12α)(a02a1δ)2)c022+αnc0α(b+3a1+4c)+c0(c1(12α)(a02a1δ)2)[r], where the next last inequality holds for all large r and the last inequality holds for all large n. Since the last exponent of n is strictly less than −1 if r is large enough, we obtain that n=1bn< if the constant r is large enough. Hence, by the first Borel-Cantelli lemma we obtain that for almost all X0,X1,,Xn, Π(θΘ: d(θ,θ0)r εn| X0,X1,,Xn)ebnεn2 if n is large enough. The proof of Theorem 1 is completed. □