357
Views
0
CrossRef citations to date
0
Altmetric
Articles

Robust oracle estimation and uncertainty quantification for possibly sparse quantiles

, &
Pages 60-77 | Received 01 Oct 2022, Accepted 10 Jun 2023, Published online: 22 Jun 2023

Abstract

A general many quantiles + noise model is studied in the robust formulation (allowing non-normal, non-independent observations), where the identifiability requirement for the noise is formulated in terms of quantiles rather than the traditional zero expectation assumption. We propose a penalisation method based on the quantile loss function with appropriately chosen penalty function making inference on possibly sparse high-dimensional quantile vector. We apply a local approach to address the optimality by comparing procedures to the oracle sparsity structure. We establish that the proposed procedure mimics the oracle in the problems of estimation and uncertainty quantification (under the so-called EBR condition). Adaptive minimax results over sparsity scale follow from our local results.

1. Introduction

Since the emergence of statistical science, the classical ‘signal+noise’ paradigm for observed data is the main testbed in theoretical statistics for a vast number of methods and techniques for various inference problems and in various optimality frameworks. Many other practically important situations can be reduced to or approximated by the ‘signal+noise’ setting, capturing the statistical essence of the original (typically, more complex) model and preserving its main features in a pure form. There is a huge literature on ‘signal+noise’ setting with big variety of combinations of the main ingredients in the study. We mention the following ingredients: assumptions on the observation model (moment conditions, independence, normality, etc.); applied methodology (LSE, penalisation, shrinkage, thresholding, (empirical) Bayesian approach, projection, FDR method); studied inference problem (estimation, detection, model selection, testing, posterior contraction, uncertainty quantification, structure recovery); structural assumptions (smoothness, sparsity, clustering, shape constraints, such as monotonicity, unimodality and convexity, loss functions (expectations of powers of 2-norm, q-norm; Hamming loss, etc.); optimality framework and pursued criteria (minimax rate, oracle rate, control of FDR, FNR, expectation of the loss function, exponential bounds for the large deviations of the loss functions, etc.). A small sample of relevant literature includes (Donoho, Johnstone, Hoch, and Stern Citation1992; Donoho and Johnstone Citation1994; Benjamini and Hochberg Citation1995; Birgé and Massart Citation2001; Johnstone and Silverman Citation2004; Baraud Citation2004; Abramovich, Benjamini, Donoho, and Johnstone Citation2006; Efron Citation2008; Babenko and Belitser Citation2010; Belloni, Chernozhukov, and Wang Citation2011; Castillo and van der Vaart Citation2012; Martin and Walker Citation2014; Johnstone Citation2017; Belitser Citation2017; Bellec Citation2018; Butucea, Ndaoud, Stepanova, and Tsybakov Citation2018; Belitser and Nurushev Citation2020).

Suppose we observe X=(X1,,Xn)Pθn, where θ=(θ1,,θn)Rn is the parameter of interest. We shall suppress n from the notation to simply write Pθ for Pθn. A natural example is given by XiindN(θi,1), i=1,,n, but this specific distributional assumption on the data generating process will not be imposed. The main modelling assumption in this note is that, for some fixed level τ(0,1), P(Xiθi0)=τ,i=1,,n.In other words, we deal with the setting many quantiles + noise, as θi's are the τ-quantiles of the observed Xi's and the goal is to make an inference on the high-dimensional parameter θ. One can study and make inference on several quantiles simultaneously across different levels τ(0,1). While the mean is a commonly used measure of centrality, it is heavily influenced by outliers or extreme values in the dataset. In contrast, quantiles are robust to outliers and provide information about the spread of the data as well as the location of the central values, cf., Zou and Yuan (Citation2008), Jiang, Wang, and Bondell (Citation2013) and Jiang, Bondell, and Wang (Citation2014). The setting ‘many quantiles + noise’ with sparse quantile vectors arises in statistical and machine learning applications where the goal is to estimate a large number of quantiles for a high-dimensional dataset. The number of quantiles to be estimated can be very large, while the data may be sparse, meaning that only a small fraction of the entries in the data are non-zero. In such applications, quantiles may be a more sensible object of study, especially when the distribution is skewed or contains outliers. In this setting, the interest of study focuses on the effect of the parameter values on the tails of a distribution of observations, in addition to, or instead of, the centre. For example, in finance, quantiles such as the value at risk (the value at risk in finance has the meaning of quantile) or expected shortfall are commonly used to estimate the potential losses of a portfolio of investments, rather than relying solely on the mean return. Estimating a large number of quantiles is important for risk management and portfolio optimisation, as it provides a more complete picture of the potential losses or gains of a portfolio. In genomics, estimating many quantiles can help identify genes or markers that are associated with disease. In image processing, estimating quantiles can help identify regions of interest or anomalies in images. Quantiles as objects of interest occur in diverse areas, including economics, biology, meteorology; cf. Santosa and Symes (Citation1986), Abrevaya (Citation2002), Koenker (Citation2005), Belloni and Chernozhukov (Citation2011), Hulmán et al. (Citation2015), Gabriela Ciuperca (Citation2018) and Belloni, Chernozhukov, and Kato (Citation2019).

Besides quantile formulation, the next distinctive feature of our study is the robust formulation in the sense that we do not assume any particular form of the distribution of the observed data. Only a mild condition (Condition C1) is imposed. The observations do not need to be normal and, in fact no specific distribution is assumed, the distribution of the ‘noise’ terms Xiθi, i=1,,n, may depend on θ and do not even have to be independent. This makes the applicability scope of our results rather broad.

The third aspect of our study is the local approach. It is in general impossible to make sensible inference on a high-dimensional parameter without any additional structure, either in terms of assumptions on parameter, or on the observation model, or both. In this paper, we are concerned with sparsity structure, various version of which have been predominantly studied in the literature recently. Sparsity structure means that a relative majority of the parameter entries are all equal to some fixed value (typically zero). In the local approach, we address optimality by comparing procedures to the oracle sparsity structure (think of an ‘oracle observer’ that knows the best sparsity pattern of the actual θ). When applying the local approach, the idea is not to rely on sparsity as such, but rather extract all the sparsity structure present in the data, and utilise it in the aimed inference problems. The claim that a procedure performs optimally in the local sense means that it attains the oracle quality, i.e. mimics the best sparsity structure pertinent to the true parameter, whichever it is. The local result imply minimax optimality over all scales (including the traditional sparsity class 0[s]) that satisfy a certain condition.

The final feature of our study concerns the problem of uncertainty quantification. In recent years, focus in nonparametric statistics has shifted from point estimation to uncertainty quantification, a more challenging problem, much less results on this topic are available in the literature; cf. Szabó, van der Vaart, and van Zanten (Citation2015), Belitser (Citation2017), van der Pas, Szabó, and van der Vaart (Citation2017) and Belitser and Nurushev (Citation2020). Certain negative results (cf. Li Citation1989; Baraud Citation2004; Belitser and Nurushev Citation2019 and the references therein) show that in general a confidence set cannot simultaneously have coverage and the optimal size uniformly over all parameter values. This makes the construction of the so-called ‘honest’ confidence sets impossible, and a strategy recently pursued in the literature is to discard a set of ‘deceptive parameters’ to ensure coverage at the remaining parameter values, while maintaining the optimal size uniformly over the whole set. In an increasing order of generality, removing deceptive parameters is expressed by imposing the conditions of self-similarity, polished tail and excessive bias restriction (EBR), see Szabó et al. (Citation2015), Belitser (Citation2017), van der Pas et al. (Citation2017) and Belitser and Nurushev (Citation2020). All the above papers deal with 2-loss framework for Gaussian observations in ‘many means + noise’ setting (a more general case is considered in Belitser and Nurushev Citation2020).

In this paper, we work with a new setting ‘many quantiles + general noise’, and use the quantile loss rather than the traditional 2-loss. The quantile loss function has certain properties of usual loss functions, but it is not symmetric, the peculiarity essentially characterising the notion of quantile. We propose and exploit a new version of EBR condition expressed in terms of quantile loss. To the best of our knowledge, there are no results on uncertainty quantification neither for the quantile loss function, nor for the 1-loss, also neither in local, nor global (minimax) formulation in this general robust setting. In this respect, our study presents the first step in this direction.

We finally summarise the main contributions of this paper.

  • For our new (robust) setting ‘many quantiles + general noise’ we propose a penalisation procedure for selecting the sparsity pattern, based on the quantile loss (as counterpart of the 2-loss) and the corresponding penalty term.

  • The estimated sparsity pattern is next used for the construction of an estimator of θ and a confidence set for θ, again in terms of the quantile loss function.

  • Two theorems establish the local (oracle) optimality of the estimator and confidence set (under the EBR condition), respectively.

  • We provide an elegant and relatively short proofs of the main results; compare with the relatively laborious proofs of related results in Szabó et al. (Citation2015), Belitser (Citation2017), van der Pas et al. (Citation2017) and Belitser and Nurushev (Citation2020).

  • The obtained results are robust and local in the sense explained above.

  • The obtained local results imply adaptive minimax optimality for scales of classes that satisfy certain relation (in particular, for the traditional sparsity scale).

Organisation of the rest of the paper is as follows. In Section 2, we give a robust formulation of the observation model in the ‘many quantiles + general noise’ setting, and introduce some notations and preliminaries. In Section 3, we present the main results of the paper and discuss their consequences for the optimality in the minimax sense for the traditional sparsity scale. Section 4 contains a small simulation study. The proofs of the theorems are provided in Section 5.

2. Robust model formulation and preliminaries

We can formally rewrite the model stated in the introduction in the familiar ‘signal+noise’ form: (1) Xi=θi+ξi,Pθ(ξi0)=τ,i[n]={1,,n}.(1) where ξi=Xiθi can be thought of as ‘noise’. Note that all the quantities depend of course on τ. Precisely, for τ(0,1), in the model (Equation1) we have θ=θ(τ), ξ=ξ(τ) with Pθ(ξi(τ)0)=τ, i=1,,n. Not to overload the notation, we suppress the dependence on τ in the sequel. As we mentioned in the Introduction, we study robust setting in the sense that we do not assume any particular distribution of ξ=(ξ1,,ξn). In fact, the ξi's do not have to be identically distributed, their distribution may depend on θ and they do not even have to be independent.

Introduce the following asymmetric absolute deviation function: ρ(x)=ρτ(x)=ρτ,m(x)=i=1mxi(τ1{xi0}),xRm,mN,which we will also call quantile loss function, 1{E} denotes the indicator of event E.

Slightly abusing notation, we use the same notation ρ(x) for any mN, for example, most of the time it will be either m = n or m = 1, depending on the dimensionality of the argument of the function. This will always be clear from the context. Often, we will suppress the dependence of ρ on a fixed τ(0,1), unless emphasising this dependence where it is relevant.

Remark 2.1

The origin of the quantile loss function ρτ(x), xR, is well explained in the book (Koenker Citation2005). The function ρτ(x) characterises the τ-quantile of a random variable Z in the sense that the τ-quantile θτ of Z is known to minimise the criterion function Eρτ(Zϑ) with respect to ϑ, i.e. argminϑREρτ(Zϑ)=θτ. Basically, this function plays the same role for characterising the quantiles as the quadratic function in characterising the expectation of a random variable.

The quantile loss function ρ can be related to the 1-criterion |x|=i=1m|xi|, x=(x1,,xm)Rm: for any τ(0,1), (2) (1cτ)|x|ρτ(x)cτ|x|,withcτ=max{τ,1τ}.(2) In particular, it follows that ρ1/2(x)=|x|/2. Another useful property of the quantile loss function ρ to be used later on is that, for any xRm, (3) ρτ(x)Cτρτ(x),withCτ=max{1ττ,τ1τ}=cτ1cτ.(3)

Remark 2.2

The above quantile loss function evaluated at the difference ρ(xy), x,yRm, possesses all the properties of a metric d(x,y) except for the symmetry. In particular, ρ(0)=0 (zero at zero); if xy, ρ(xy)>0 (positivity); and finally the triangle inequality holds (4) ρ(x+y)ρ(x)+ρ(y),x,yRm.(4)

Since we have as many parameters as observations, it is typically impossible to make inference on θ even in a weak sense unless the data possesses some structure. Here, we work with the structural assumption that θ is (possibly) a sparse vector. Specifically, we assume that θLI, where  I[n]={1,,n} and LI={xRn:xi=0,iIc}, where Ic=[n]I. The structure studied here is the unknown ‘true’ sparsity pattern I0=I0(θ)[n], that is, θLI0 and I0 is the ‘minimal’ sparsity structure in the sense that I0=argminI[n]:LIθiIi.

Introduce some further notation: let I denote the family of all subsets of [n]; for xRn, denote its 1-norm by |x|=i=1n|xi|; |S| denote the cardinality of a set S. Further let us introduce the so-called ‘quantile projection’ PI onto LI (called just projection in what follows), with respect to the quantile loss function ρ(xy). For xRn, define PIx as follows: infyLIρ(xy)=ρ(xPIx).In view of the properties of the quantile loss function ρ given by Remark 2.2, the projection PI is readily found as the following linear operator: PIx=(xi1{iI},i=1,,n)Rn.We will use later a certain monotonicity property of the quantile projection operator. (5) IfI1I2,thenρ(PI1x)ρ(PI2x),xRn.(5) For s0 and II, denote (with the convention 0log(a/0)=0 for a>0) λ(s)=slog(en/s),p(I)=(|I|λ(|I|))1/2=|I|[log(en/|I|)]1/2.We have that λ(s)s for s[0,n], and besides, since λ(s) is increasing in s(0,n], λ(s)1+log(n) for all s[n]. The function p(I) has a meaning of complexity of structure I.

The following relation will be used later: for any ν>1, (6) IIeνλ(|I|)=1+s[n]I:|I|=seνλ(s)1+s[n]e(ν1)sCν,(6) with Cν=1+(eν11)1<.

The following conditions on ξ=Xθ is assumed throughout.

Condition C1. For some H0,M0>0 and some positive function ψ(u) monotonically increasing to infinity as u, and all II, I, all M0, (7) supθRnPθ(ρ(PIξ)>Mp(I))H0e(ψ(M)M0)λ(|I|).(7) Notice that Condition C1 is trivially fulfilled for I= with zero right-hand side. Let P=P(H0,M0,ψ) stand for the family of distributions Pθ satisfying Condition C1.

As λ(s)1+log(n) for all s[n], the right-hand side of (Equation7) is further bounded by H0e(ψ(M)M0)λ(1)H0n(ψ(M)M0), which we will use in the proof.

Remark 2.3

Condition C1 is mild, for instance, it holds for independent sub-gaussian ξi's. Recall one of the equivalent definitions of sub-gaussianity (see Vershynin Citation2018): a random variable W is called σ- sub-gaussian for σ>0 if for some c0>0 Eec0W2/σ22. In our case, σ=1. For example, for ξiindN(0,1), c0=3/8 (see Section 4). Recall that x2=i=1nxi2 denotes the usual squared 2-norm of xRn. Now, since PIξ has at most |I| non-zero entries, by (Equation2) and the Cauchy–Schwartz inequality we have ρ(PIξ)cτ|PIξ|cτ|I|1/2PIξ (this also follows from the following inequality between two different q-norms: for xRn and 0<q1<q2, xq2xq1n1/q11/q2xq2). Recall also that λ(s)s, s[0,n]. Using these and the Markov inequality, we obtain that, for H0=1, ψ(M)=c0cτ2M2 and M0=log2, P(ρ(PIξ)Mp(I))P(cτ|I|1/2PIξ2Mp(I))=P(PIξ2Mcτ[λ(|I|)]1/2)P(c0PIξ22c0cτ2M2λ(|I|))e|I|log2c0cτ2M2λ(|I|)e(ψ(M)M0)λ(|I|).One can extend the results to the case of the so-called sub-exponential errors, with adjusted complexity function p(I); see Remark 3.3.

Remark 2.4

Condition C1 is clearly satisfied for bounded, arbitrarily dependent, ξi's. This condition allows some interesting cases of dependent ξi's. By arguing in the same way as in Belitser and Nurushev (Citation2020), we can establish that Condition C1 holds also for ξk's which follow an autoregressive model AR(1) with sub-gaussian white noise ϵk's (for appropriately chosen model parameters): ξk=αξk1+ϵk,k[n];ξ0=0,|α|<1.In Belitser and Nurushev (Citation2020), the normal white noise was used in the above model AR(1), but this can easily be extended to the sub-gaussian case by adjusting the constants involved.

Remark 2.5

We can extend our setting by including also a parameter σ>0 by considering σξi instead of just ξi (and (Xiθ)/σ instead of Xiθ) in (Equation1) and in Condition C1. In that case, the parameter σ>0 is assumed to be known and fixed throughout. Together with n, it reflects the information amount in the model in the sense that σ0 means the flux of information.

We propose a penalised projection estimator. For κ>0, define I^=I^(X,κ) to be a minimiser of the criterion C(I)=C(I,κ)=C(I,κ,X): (8) C(I)=ρ(XPIX)+κp(I)=ρ(PIcX)+κp(I),minIIC(I)=C(I^).(8)

Remark 2.6

Since the proposed procedure I^ for selecting sparsity pattern is based on the non-symmetric quantile loss function ρ, the deviations from below and from above are treated differently. But the computation of the estimator I^ of the sparsity pattern is not difficult. Indeed, it can be reduced to the search over n options: with bk=ρ(Xk)0, k[n], minIIC(I)=minII{ρ(PIcX)+κp(I)}=minII{kIcρ(Xk)+κp(I)}=mini[n]{minII:|I|=ikIcρ(Xk)+κilog1/2(eni)}=mini[n]{k=1nib(k)+κ(ilog1/2(eni)}=mini[n]{Bi+κ(ilog1/2(eni)},where b(1)b(n) is the ordered sequence of bk's and Bi=k=1nib(k).

Now, using the estimated sparsity structure I^, define the estimator (9) θ^=θ^(X,κ)=PI^X.(9) From the triangle inequality (Equation4), it follows that ρ(xy)ρ(x)ρ(y). Using this, (Equation3) and the definition (Equation8), we derive that, for any II, (10) κ(p(I^)p(I))ρ(PIcX)ρ(PI^cX)=ρ(PI^IX)ρ(PII^X)ρ(PI^Iθ)+ρ(PI^Iξ)ρ(PII^θ)+ρ(PII^(ξ))ρ(PIcθ)ρ(PI^cθ)+ρ(PI^ξ)+Cτρ(PIξ).(10) For ϰ>0, θRn and II, introduce the quantity r(θ,I)=rϰ(θ,I)=ρ(θPIθ)+ϰp(I)=ρ(PIcθ)+ϰp(I),which we call the quantile rate of the sparsity structure I. The oracle sparsity structure Io=Io(θ)=Io(θ,ϰ) is the one minimising rϰ(θ,I): (11) minIIrϰ(θ,I)=rϰ(θ,Io)=rϰ(θ)=r(θ).(11) where its minimal value r(θ) is called the oracle quantile rate, or just oracle rate.

3. Main results

In this section, we give the main results. All the constants in the below assertions depend on the fixed constants M0,H0 and function ψ appearing in Condition C1, the constant κ appearing in the oracle structure definition (Equation8), and the quantile level τ.

The next result concerns the estimation problem.

Theorem 3.1

Estimation

Let Condition C1 be fulfilled. Then for any ϰ>0 and sufficiently large κ, there exist positive M1,H1,m1 such that (12) Pθ(ρ(θθ^)M1r(θ))H1em1λ(1)H1nm1,θRn,(12) where θ^ is defined by (Equation9).

Next we address the new problem of uncertainty quantification (UQ) for the parameter θ. A confidence set is defined in terms of quantile loss function ρ as follows: (13) B(θ~,r~)={θRn:ρ(θθ~)r~},(13) where the ‘ centre’ θ~=θ~(X):RnRn and ‘radius’ r~=r~(X):RnR+=[0,+] are measurable functions of the data X. The goal is to construct such a confidence set B(θ~,Cr~) that for any α1,α2(0,1] and some function R(θ), R:RnR+, there exist C, c>0 such that (14) supθΘcovPθ(θB(θ~,Cr~))α1,supθΘsizePθ(r~cR(θ))α2,(14) for some Θcov,ΘsizeRn. The function R(θ), called radial rate, is a benchmark for the effective radius of the confidence set B(θ~,Cr~). The first expression in (Equation14) is called coverage relation and the second size relation. It is desirable to find the smallest r(θ), the biggest Θcov and Θsize such that (Equation14) holds and R(θ)r(Θsize), where r(Θsize) is the optimal rate in estimation problem for θ. In our local approach, we pursue even more ambitious goal R(θ)r(θ), where r(θ) is the oracle rate from the (local) estimation problem.

Typically, the so-called deceptiveness issue arises for the UQ problem in that the confidence set of the optimal size and high coverage can only be constructed for non-deceptive parameters (in particular, Θcov cannot be the whole set Rn). That is, coverage with an optimal sized radius is only possible if certain deceptive set of parameters is excluded from consideration, which is expressed by imposing some condition on the parameter. For example, the EBR (excessive bias restriction) condition in case of 2-norm is proposed in Belitser and Nurushev (Citation2019), Belitser (Citation2017) and Belitser and Ghosal (Citation2020). Here we need an EBR-like condition (we keep the same term EBR), but now in terms of the quantile loss function ρ.

Condition EBR. We say that parameter θRn satisfies the excessive bias restriction (EBR) condition with structural parameter t0 if θΘeb(t)=Θeb(t,ϰ) where (15) Θeb(t)={θRn:ρ(θPIoθ)p(Io)t},(15) the oracle structure Io=Io(θ) is defined by (Equation11) (with the convention 0/0=0).

The extent of the restriction θΘeb(t) varies over different choices of constant t, becoming more lenient as t increases, eventually covering the entire parameter space. In general, for any t, a sequence of θ's can be found such that θΘeb(t), when n varies. On the other hand, the set Θeb(0) is not empty and consists of such θ's for which the oracle Io coincides with the true sparsity structure I0. For a general discussion on EBR, we refer the reader to Belitser and Nurushev (Citation2019). For sparsity structure in 1-sense, the EBR condition is satisfied if the minimal absolute value of the non-zero coordinates of θ is larger than a certain lower bound, but this bound depends also on the number of the non-zero coordinates of θ. For example, for some sufficiently large C>0, II{θLI:|θi|Clog1/2(en|I|),iI}Θeb(0).

Theorem 3.2

Confidence ball

Let the confidence set B(,) be defined by (Equation13), r^=p(I^), I^ and θ^ be given by (Equation8) and (Equation9) respectively. Then for sufficiently large κ,ϰ there exist constants M2=M2(t),H2,m2,M3,H3,m3>0 such that for any t0, (16) supθΘeb(t)Pθ(θB(θ^,M2(t)r^))H2nm2,(16) (17) supθRnPθ(r^M3r(θ))H3nm3.(17)

Constants Hi, Mi, mi, i = 1, 2, 3, are all evaluated in the proofs of Theorems 3.1 and 3.2 in the form of several bounds, which depend on function ψ, constants H0,M0 from Condition C1, the quantile level τ and ϰ. There is no ‘optimal’ choice, in fact, many choices are possible, an improvement of one constant typically lead to worsening another one. It should also be noted that the constants (any choice satisfying the bounds in the proofs) are uniform over the whole family P of distributions satisfying Condition C1, and can therefore be significantly improved for specific error distributions.

So far, all results are formulated in terms of the oracle. For estimation and the construction of a confidence set, this local approach delivers the most general results, as the convergence rate of the proposed estimator is directly linked with the oracle sparsity rather than the true structure, which may not be sparse (but very close to a sparse structure). However, if the parameter has a true sparse structure, the method will attain the quality pertinent to that true sparsity as well.

To illustrate this, consider any θRn. The true structure I0(θ) and the oracle structure Io(θ) in general do not coincide, but they are related by (18) r(θ)=r(θ,Io)r(θ,I0)=ϰp(I0)=ϰ|I0|log1/2(en/|I0|).(18) The first two relations hold by the definition of the oracle and the third because PI0θ=θ for the true structure. Since r(θ)=ρ(θPIoθ)+ϰp(Io), (Equation18) implies that p(Io)p(I0) and hence |Io||I0|. Further, as |I0(θ)|s for θ0[s], (Equation18) also yields supθ0[s]r(θ)ϰslog1/2(en/s).Besides, all the constants in Theorems 3.1 and 3.2 are uniform over θ0[s] and PθP. The next result follows from these facts and Theorems 3.1 and 3.2.

Corollary 3.1

Under the conditions of Theorems 3.1 and 3.2, with the same choice of the constants, supθ0[s]supPθPPθ(ρ(θθ^)M1ϰslog1/2(en/s))H1nm1,supθ0[s]supPθPPθ(r^M3ϰslog1/2(en/s))H3nm3,and (Equation16) holds.

We claim that the obtained convergence rate slog1/2(en/s) in terms of quantile loss function is optimal over the class 0[s] in the minimax sense. More precisely, the results of Johnstone and Silverman (Citation2004) (in Johnstone and Silverman Citation2004, relations (17) and (18) with p = 0 and q = 1) and (Equation2) imply that for the normal model Pθ=i[n]N(θi,1), there exist absolute C1 such that infθ~supθ0[s]Eθ|θθ~|C1slog1/2(en/s).This lower bound is not quite what we need to match with our upper bound because it is formulated in terms of expectation and the 1-loss. However, it is not difficult to establish the probability version of the lower bound: there exist absolute C1,C2>0 such that infθ~supθ0[s]Pθ(|θθ~|C1slog1/2(en/s))C2.The relation (Equation2) connects the 1-loss with the quantile loss, so that the above relation implies that infθ~supθ0[s]Pθ(ρ(θθ~)C1cτ1slog1/2(en/s))C2.In this form, the lower bound matches our upper bound given by Corollary 3.1, basically showing that our procedure attains the optimal rate slog1/2(en/s) for the sparsity class 0[s]. This also means that the size of the constructed confidence ball B(θ^,M2(t)r^) is also optimal over the sparsity class 0[s] in the minimax sense. However, the unavoidable price for the optimality in the size relation is that the coverage relation holds uniformly over Θeb(t), not over 0[s].

Remark 3.1

Interestingly, although there are some ‘deceptive’ θ in 0[s] that are not covered by Θeb(t), there are also some θ's in Θeb(t) which do not belong to the sparsity class 0[s], but for which the coverage relation holds.

Remark 3.2

We derived the adaptive (sparsity s is also unknown) minimax results over the traditional sparsity scale {0[s],s[n]} as consequence of our local oracle results. The scope of our local result is even broader, the minimax results can be derived over any scale of classes {Θs,sS} with the corresponding minimax rates R(Θs) as longs as, for some c>0, supθΘsr(θ)cR(Θs),sS.

Indeed, if the above relation is fulfilled, we immediately obtain all the claims of Corollary 3.1 with Θs instead of 0[s], as consequences of Theorems 3.1 and 3.2. For example, it seems possible to derive the minimax results also for the scale of the s-balls: s[η]={θRn:1ni=1n|θi|sηs}, s[0,2], with η=ηn0 as n.

Remark 3.3

Interestingly, in relation to Remark 2.3, Condition C1 also holds for independent sub-exponential ξi's, but with the adjusted function p(I)=pexp(I)=λ(|I|). Recall one of the equivalent definitions of sub-exponentiality (see Vershynin Citation2018): a random variable W is called σ-sub-exponential if Eec0|W|2 for some c0>0. For example, for the Laplace distribution ξiindfλ(x)=λ2eλ|x|, 2Eec0|ξ1|=λλc0, c0<λ, we take c0=λ2. In particular, if λ=1, c0=12.

Now, since PIξ has at most |I| non-zero entries, by (Equation2) we have ρ(PIξ)cτ|PIξ|=cτiI|ξi|. Recall also that λ(s)s, s[0,n]. Using these and the Markov inequality, we obtain that P(ρ(PIξ)Mpexp(I))P(cτiI|ξi|Mpexp(I))=P(c0iI|ξi|Mc0cτpexp(I))e|I|log2c0cr1Mpexp(I)e(ψ(M)M0)λ(|I|),for ψ(M)=c0cr1M and M0=log2. Similarly to the setting discussed in Remark 2.4, it is possible to generalise the case of independent sub-exponential errors ξi's to the case of errors driven by an AR(1) model with sub-exponential white noise.

Thus, for sub-exponential ξi's, Theorems 3.1 and 3.2 hold with the oracle rate rexp(θ)=minII(ρ(PIcθ)+ϰpexp(I)) . Notice a slight deterioration of the oracle rate rexp(θ) as compared to the sub-gaussian case, as we now have the complexity pexp(I)=|I|log(en/|I|) instead of p(I)=|I|[log(en/|I|)]1/2. This will also lead to the corresponding slightly deteriorated global rate slog(en/s) (instead of slog1/2(en/s)) in Corollary 3.1. Intuitively, a worse rate for the sub-exponential case is not surprising and should be considered as price for the ‘heavy tailedness’ of the erros ξi's.

The question remains whether the resulting rate in this case λ(s)=slog(en/s) is minimax over the sparsity scale {0[s],s[n]} for the 1-norm and sub-exponential errors. We did not find relevant results in the literature on this, but we conjecture that the minimax rate over sparsity scale for the q-norm with errors with density fξ(x)e|x|γ, γ(0,2], is expected to be s[log(en/s)]q/γ.

4. A simulation study

In this section, we present a small simulation study. The main goal is to demonstrate the deceptiveness phenomenon for the UQ problem, which concerns the coverage relation of Theorem 3.2. Theorem 3.1 and the size relation of Theorem 3.2 will also be demonstrated in passing. Recall that the coverage relation in Theorem 3.2 holds only for the so-called non-deceptive parameters which are described by the EBR condition: θΘeb(t). Below we provide an example illustrating the failure of the coverage relation for a deceptive parameter θ. Exactly, we construct a sequence of ‘deceptive’ θnRn, nN, such that, for any M2, Pθn(θnB(θ^,M2r^))1 as n.

Consider the simplest setting in the model (Equation1): ξiindN(0,1), i=1,,n, and τ=1/2. In this case, according to Remark 3, Condition C1 is satisfied with H0=1, ϕ(M)=c0c0.52M2 and M0=log2 where c0.5=12 and such c0 that 2Eec0W2=(12c0)1/2 for WN(0,1), leading to a choice c0=38. Consider a parameter θn=(dn,1,,1)Rn with any dn>2ϰp({1})=2ϰlog1/2(en), so that the oracle sparsity structure Io=Io(θn)={1} for all nN and hence the ratio in the definition (Equation15) of the EBR condition is ρ(θnPIoθn)p(Io)=n12log(en)1/2=tn, nN. Since tn as n, this means that the sequence {θn} can be seen as ‘deceptive’ in that for any t0 there exists n0 such that θnΘeb(t) for all nn0. In the below simulations, we took κ=1.1, ϰ=1.4 and dn=10 which ensures the condition dn>2ϰlog1/2(en) for all considered n's.

Introduce the quantities D~n=ρ(θnθ^)r(θn), D^n=ρ(θnθ^)r^ and D¯n=r^r(θn). Now we perform the following simulations: for a chosen nN and the above-described θnRn, generate B = 1000 data samples, then compute B values of θ^(k), r^(k), and the corresponding D^n(k), D~n(k) and D¯n(k), k=1,,B. Since {ρ(θnθ^)M1r(θn)}={D~nM1}, {θnB(θ^,M2r^)}={D^n>M2} and {r^M3r(θn)}={D¯nM3}, we can approximate the probabilities of the events from Theorems 3.1 and 3.2 as follows: Pθn(ρ(θnθ^)M1r(θn))1Bk=1B1{D~n(k)M1},Pθn(θnB(θ^,M2r^))1Bk=1B1{D^n(k)>M2},Pθn(r^M3r(θn))1Bk=1B1{D¯n(k)M3}.In Figure , the boxplots of D~n(k)'s, D^n(k)'s and D¯n(k)'s are plotted for several increasing n = 50, 100, 200, 400, 800, 1600, 3200. We see that the boxplots of D~n(k) (the plot on the left) stabilises around 1, this is because of the special choice of θn for which ρ(θnθ^) is asymptotically equivalent to the oracle rate r(θn). Thus, the first probability in the above display will go to zero for any M1>1, which shows that the claim of Theorem 3.1 holds also for this deceptive {θn} as it should because the result of Theorem 3.1 is uniform over θRn. The size relation of Theorem 3.2 holds as well (as it also should): the boxplots of D¯n(k)'s (the plot on the right) even move to zero, showing that the third probability in the above display will converge to zero for any M3>0.

Figure 1. Sub-gaussian case: boxplots of D~n(k)'s, D^n(k)'s and D¯n(k)'s for increasing n and deceptive {θn}.

Figure 1. Sub-gaussian case: boxplots of D~n(k)'s, D^n(k)'s and D¯n(k)'s for increasing n and deceptive {θn}.

The most interesting plot is in the middle of Figure , it shows that the boxplots of D^n(k)'s move up as n increases, ensuring that the fraction of data points such that D^n(k)>M2 goes to 1 as n for any constant M2. Hence, Pθn(θB(θ^,M2r^))1 as n, for the chosen deceptive parameter sequence {θn}, which demonstrates the deceptiveness phenomenon.

Finally, we perform simulations for the sub-exponential case discussed in Remark 3.3 and a non-deceptive parameter θn=(dn,,dn,0,,0)Rn. Precisely, we consider the Laplace errors ξiindLaplac(1,0), i=1,,n (i.e. with density 12e|x|), we set the first 20 entries of parameter θn equal to dn and the remaining coordinates to zero. This means also that we used complexity pexp(I) instead of p(I). All the other parameters of this simulation experiment remain the same as before.

Figure  shows that the claims of Theorems 3.1 and 3.2 hold for non-deceptive {θn}. We now see a downward trend in the boxplots of D^n(k)'s as n increases illustrating the good behaviour of the confidence set for non-deceptive {θn} also in the sub-exponential case.

Figure 2. Sub-exponential case: boxplots of D~n(k)'s, D^n(k)'s, D¯n(k)'s for increasing n and non-deceptive {θn}.

Figure 2. Sub-exponential case: boxplots of D~n(k)'s, D^n(k)'s, D¯n(k)'s for increasing n and non-deceptive {θn′}.

5. Proofs

In this section, we present the proofs of the theorems.

Proof of Theorem 3.1.

First we introduce some notation. Define the events E1={ρ(θθ^)M1r(θ)},E2={r(θ,I^)A1r(θ)},E3={ρ(PI^ξ)A2p(I^)},where the constants M1,A1,A2>0 are to be chosen later. We evaluate the probability of interest Pθ(E1)=Pθ(ρ(θθ^)M1r(θ)) as follows: (19) Pθ(E1)=Pθ(E1E2E3)+Pθ(E1E3c)+Pθ(E1E2cE3)=T1+T2+T3.(19) We bound these three probabilities separately.

First we bound T1. Using the properties (Equation3) and (Equation4) of the quantile loss function ρ, we derive that the event E1E2E3 implies that M1r(θ)ρ(θθ^)=ρ(θPI^X)ρ(θPI^θ)+ρ(PI^ξ)ρ(θPI^θ)+A2Cτp(I^)max{1,A2Cτϰ}r(θ,I^)max{1,A2Cτϰ}A1r(θ).Hence, if M1>A1max{1,A2Cτϰ}, then (20) T1=Pθ(E1E2E3)Pθ(M1r(θ)max{1,A2Cτϰ}A1r(θ))=0.(20) To bound T2, write T2=Pθ(E1E3c)Pθ(E3c)IIP(ρ(PIξ)>A2p(I)).If A2>0 is such that ψ(A2)>M0+1, then using Condition C1 (Equation7), we obtain that P(ρ(PIξ)>A2p(I))H0eA3λ(|I|),II,I,where we set A3=ψ(A2)M0. Recall that λ(s) is increasing for s(0,n]. This, (Equation6) and the two previous displays entail that, since A3>1 (so that A4=(A31)2>0), (21) T2Pθ(E3c)IIP(ρ(PIξ)>A2p(I))HξII:|I|1eA3λ(|I|)HξeA4λ(1)II:|I|1e(1+A4)λ(|I|)H1eA4λ(1).(21) Finally, we bound T3=Pθ(E1E2cE3). Using (Equation10) with I=Io and κ>A2, we obtain that under E2cE3, Cτρ(PIoξ)ρ(PI^cθ)+κp(I^)ρ(PIocθ)κp(Io)ρ(PI^ξ)ρ(PI^cθ)+κp(I^)ρ(PIocθ)κp(Io)A2p(I^)min{κA2ϰ,1}r(θ,I^)max{κϰ,1}r(θ)min{κA2ϰ,1}A1r(θ)max{κϰ,1}r(θ)=A5r(θ)A5ϰp(Io),as long as A5=min{κA2ϰ,1}A1max{κϰ,1}>0. We conclude that the event E2cE3 implies the event {ρ(PIoξ)A5ϰCτp(Io)}, so that, by Condition C1 (Equation7), (22) T3Pθ(E2cE3)Pθ(ρ(PIoξ)A5ϰCτp(Io))HξeA6λ(1),(22) where A6=ψ(A5ϰCτ)M0>0 if A5>0 is chosen so large that ψ(A5ϰCτ)>M0.

To summarise the choices of the constants, we need to take such κ,M1>0 (in the claim of the theorem) and such constants A1,A2>0 (in the proof of the theorem) that M1>A1max{1,A2Cτϰ}, A3=ψ(A2)>M0+1, A5=min{κA2ϰ,1}A1max{κϰ,1}>0 (for example, we can fix κ=A2+1) and ψ(A5ϰCτ)>M0, which is always possible since function ψ(u) monotonically as u. Combining (Equation19) –(Equation22), we obtain the claim of the theorem with the chosen κ, M1, H1=H1+H0 and m1=min{A4,A6}.

Proof of Theorem 3.2.

For some fixed δ(0,1) (for example, take δ=1/2), introduce the event E4={p(I^)δp(Io)}, where Io=Io(θ) is defined by (Equation11).

First we evaluate Pθ(p(I^)δp(Io))=Pθ(E4). We have p(I^Io)=|I^Io|log1/2(en|I^Io|)|I^|log1/2(en|I^Io|)+|Io|log1/2(en|I^Io|)|I^|log1/2(en|I^|)+|Io|log1/2(en|Io|)=p(I^)+p(Io).By using the above relation, we obtain that, under the event E4, p(I^Io)p(I^)+p(Io)(1+δ)p(Io). Hence, we have that, under E4, 11+δp(I^Io)p(Io)p(I^Io).The last relation, (Equation5) with I1=(I^Io)cIoc=I2 and the definition (Equation11) of the oracle Io imply that, under E4, ρ(PI^cθ)ρ(P(I^Io)cθ)ρ(PI^cθ)ρ(PIocθ)ϰ(p(Io)p(I^))ϰ(1δ)p(Io)ϰ1δ1+δp(I^Io).Recall the event E3={ρ(PI^ξ)A2p(I^)} from the proof of the previous theorem and let κ be sufficiently large to satisfy κ>A2. Further, choose ϰ sufficiently large to satisfy ϰ1δ1+δκ>CτA7 with such A7>0 that A8=ψ(A7)M0>0. Then, under E3E4, from (Equation10) with I=I^Io, it follows that, (23) Cτρ(PIoξ)κ(p(I^)p(I))A2p(I^)+ρ(PI^cθ)ρ(PIcθ)(ϰ1δ1+δκ)p(I^Io)(ϰ1δ1+δκ)p(Io)>CτA7p(Io).(23) Using the last display, (Equation21), Condition C1 (Equation7) and A8=ψ(A7)M0>0, we bound (24) Pθ(p(I^)δp(Io))=Pθ(E4)Pθ(E3c)+Pθ(E3E4)H1eA4λ(1)+Pθ(ρ(PIoξ)>A7p(Io))H1eA4λ(1)+H0eA8λ(1).(24) Now we establish the coverage property. The constants M1, H1 and m1 are defined in Theorem 3.1. Take M2=M1(t+ϰ)δ, where fixed δ(0,1) is from the definition of the event E4. If θΘeb(t), then, in view of (Equation11), r(θ)=ρ(PIocθ)+ϰp(Io)(t+ϰ)p(Io). So, p(Io)(t+ϰ)1r(θ) for all θΘeb(t). Combining this with Theorem 3.1 and (Equation24) yields that, uniformly in θΘeb(t), Pθ(θB(θ^,M2r^))Pθ(ρ(θθ^)>M2r^,r^δp(Io))+Pθ(r^<δp(Io))Pθ(ρ(θθ^)>M1(t+ϰ)p(Io))+Pθ(p(I^)<δp(Io))Pθ(ρ(θθ^)>M1r(θ))+Pθ(p(I^)<δp(Io))H1em1λ(1)+H1eA4λ(1)+H0eA8λ(1).The coverage relation follows.

Let us show the size property. Recall again the event E3={ρ(PI^ξ)A2p(I^)} from the proof of the previous theorem. By using (Equation23) with I=Io, we derive that the event {p(I^)M3r(θ)}E3 implies the event {Cτρ(PIoξ)(κA2)p(I^)r(θ)A9p(I^)A9M3r(θ)A9M3ϰp(Io)} with A9=κA2M31>0, where A2 is defined in the proof of Theorem 3.1. We thus have that, for any θRn, Pθ(r^M3r(θ))Pθ(E3c)+Pθ(r^M3r(θ),E3)=Pθ(E3c)+Pθ(p(I^)M3r(θ),E3)Pθ(E3c)+Pθ(p(I^)M3r(θ),Cτρ(PIoξ)A9p(I^))Pθ(E3c)+Pθ(Cτρ(PIoξ)A9p(I^)A9M3ϰp(Io))H1eA4λ(1)+H0eA10λ(1),where the last inequality in the above display is obtained by (Equation21) and Condition C1 (Equation7), and M3 is chosen to be so large that A10=ψ(A9M3ϰ/Cτ)M0>0. The size relation follows.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Abramovich, F., Benjamini, Y., Donoho, D.L., and Johnstone, I.M. (2006), ‘Adapting to Unknown Sparsity by Controlling the False Discovery Rate’, The Annals of Statistics, 34, 584–653.
  • Abrevaya, J. (2002), ‘The Effects of Demographics and Maternal Behavior on the Distribution of Birth Outcomes’, in Economic Applications of Quantile Regression, eds. B. Fitzenberger, R. Koenker, and J. A. F. Machado, Berlin Heidelberg: Springer-Verlag, pp. 247–257.
  • Babenko, A., and Belitser, E. (2010), ‘Oracle Convergence Rate of Posterior Under Projection Prior and Bayesian Model Selection’, Mathematical Methods of Statistics, 19, 219–245.
  • Baraud, Y. (2004), ‘Confidence Balls in Gaussian Regression’, The Annals of Statistics, 32, 528–551.
  • Belitser, E. (2017), ‘On Coverage and Local Radial Rates of Credible Sets’, The Annals of Statistics, 45, 1124–1151.
  • Belitser, E., and Ghosal, S. (2020), ‘Empirical Bayes Oracle Uncertainty Quantification for Regression’, The Annals of Statistics, 31, 536–559.
  • Belitser, E., and Nurushev, N. (2019), General Framework for Projection Structures. ArXiv: 1904.01003.
  • Belitser, E., and Nurushev, N. (2020), ‘Needles and Straw in a Haystack: Robust Empirical Bayes Confidence for Possibly Sparse Sequences’, Bernoulli, 26, 191–225.
  • Bellec, P.C. (2018), ‘Sharp Oracle Inequalities for Least Squares Estimators in Shape Restricted Regression’, The Annals of Statistics, 46, 745–780.
  • Belloni, A., and Chernozhukov, V. (2011), ‘L1-Penalized Quantile Regression in High-Dimensional Sparse Models’, The Annals of Statistics, 39, 82–130.
  • Belloni, A., Chernozhukov, V., and Kato, K. (2019), ‘Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models’, Journal of the American Statistical Association, 114, 749–758.
  • Belloni, A., Chernozhukov, V., and Wang, L. (2011), ‘Square-root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming’, Biometrika, 98, 791–806.
  • Benjamini, Y., and Hochberg, Y. (1995), ‘Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing’, Journal of the Royal Statistical Society: Series B, 57, 289–300.
  • Birgé, L., and Massart, P. (2001), ‘Gaussian Model Selection’, Journal of the European Mathematical Society, 3, 203–268.
  • Butucea, C., Ndaoud, M., Stepanova, N., and Tsybakov, A. (2018), ‘Variable Selection with Hamming Loss’, The Annals of Statistics, 46, 1837–1875.
  • Castillo, I., and van der Vaart, A. (2012), ‘Needles and Straw in a Haystack: Posterior Concentration for Possibly Sparse Sequences’, The Annals of Statistics, 40, 2069–2101.
  • Donoho, D.L., and Johnstone, I.M. (1994), ‘Minimax Risk Over ℓp-Balls for ℓq-Error’, Probability Theory and Related Fields, 99, 277–303.
  • Donoho, D.L., Johnstone, I.M., Hoch, J.C., and Stern, A.S. (1992), ‘Maximum Entropy and the Nearly Black Object (with Discussion)’, Journal of the Royal Statistical Society: Series B, 54, 41–81.
  • Efron, B. (2008), ‘Microarrays, Empirical Bayes and the Two-Groups Model’, Statistical Science, 23, 1–22.
  • Gabriela Ciuperca, G. (2018), ‘Test by Adaptive Lasso Quantile Method for Real-Time Detection of a Change-Point’, Metrika, 81, 689–720.
  • Hulmán, A., Witte, D.R., Kerényi, Z, Madarász, E., Tánczer, T., Bosnyák, Z, Szabó, E., Ferencz, V., Péterfalvi, A., Tabák, A.G, and Nyári, T.A. (2015), ‘Heterogeneous Effect of Gestational Weight Gain on Birth Weight: Quantile Regression Analysis From a Population-Based Screening’, Annals of Epidemiology, 25, 133–137.
  • Jiang, L., Bondell, H.D., and Wang, H.J. (2014), ‘Interquantile Shrinkage and Variable Selection in Quantile Regression’, Computational Statistics & Data Analysis, 69, 208–219.
  • Jiang, L., Wang, H.J., and Bondell, H.D. (2013), ‘Interquantile Shrinkage in Regression Models’, Journal of Computational and Graphical Statistics, 22, 970–986.
  • Johnstone, I.M. (2017), Gaussian Estimation: Sequence and Wavelet Models, Book Draft.
  • Johnstone, I.M., and Silverman, B.W. (2004), ‘Needles and Straw in Haystacks: Empirical Bayes Estimates of Possibly Sparse Sequences’, The Annals of Statistics, 32, 1594–1649.
  • Koenker, R. (2005), Quantile Regression, Cambridge: Cambridge University Press.
  • Li, K.-C. (1989), ‘Honest Confidence Regions for Nonparametric Regression’, The Annals of Statistics, 17, 1001–1008.
  • Martin, R., and Walker, S.G. (2014), ‘Asymptotically Minimax Empirical Bayes Estimation of a Sparse Normal Mean Vector’, Electronic Journal of Statistics, 8, 2188–2206.
  • Santosa, F., and Symes, W.W. (1986), ‘Linear Inversion of Band-Limited Reflection Seismograms’, SIAM Journal on Scientific and Statistical Computing, 7, 1307–1330.
  • Szabó, B.T., van der Vaart, A.W., and van Zanten, J.H. (2015), ‘Frequentist Coverage of Adaptive Nonparametric Bayesian Credible Sets’, The Annals of Statistics, 43, 1391–1428.
  • van der Pas, S.L., Szabó, B.T., and van der Vaart, A.W. (2017), ‘Uncertainty Quantification for the Horseshoe (with Discussion)’, Bayesian Analysis, 12, 1221–1274.
  • Vershynin, R. (2018), High-Dimensional Probability. An Introduction with Applications in Data Science, Cambridge: Cambridge University Press.
  • Zou, H., and Yuan, M. (2008), ‘Composite Quantile Regression and the Oracle Model Selection Theory’, The Annals of Statistics, 36, 1108–1126.