2,037
Views
11
CrossRef citations to date
0
Altmetric
Theory and Methods

On the Length of Post-Model-Selection Confidence Intervals Conditional on Polyhedral Constraints

&
Pages 845-857 | Received 02 Feb 2018, Accepted 06 Feb 2020, Published online: 22 Apr 2020

Abstract

Valid inference after model selection is currently a very active area of research. The polyhedral method, introduced in an article by Lee et al., allows for valid inference after model selection if the model selection event can be described by polyhedral constraints. In that reference, the method is exemplified by constructing two valid confidence intervals when the Lasso estimator is used to select a model. We here study the length of these intervals. For one of these confidence intervals, which is easier to compute, we find that its expected length is always infinite. For the other of these confidence intervals, whose computation is more demanding, we give a necessary and sufficient condition for its expected length to be infinite. In simulations, we find that this sufficient condition is typically satisfied, unless the selected model includes almost all or almost none of the available regressors. For the distribution of confidence interval length, we find that the κ-quantiles behave like 1/(1κ) for κ close to 1. Our results can also be used to analyze other confidence intervals that are based on the polyhedral method.

1 Introduction

Lee et al. (2016) recently introduced a new technique for valid inference after model selection, the so-called polyhedral method. Using this method, and using the Lasso for model selection in linear regression, Lee et al. (2016) derived two new confidence sets that are valid conditional on the outcome of the model selection step. More precisely, let m̂ denote the model containing those regressors that correspond to nonzero coefficients of the Lasso estimator, and let ŝ denote the sign-vector of those nonzero Lasso coefficients. Then Lee et al. (2016) constructed confidence intervals [Lm̂,ŝ,Um̂,ŝ] and [Lm̂,Um̂] whose coverage probability is 1α, conditional on the events {m̂=m,ŝ=s} and {m̂=m}, respectively (provided that the probability of the conditioning event is positive). The computational effort in constructing these intervals is considerably lighter for [Lm̂,ŝ,Um̂,ŝ]. In simulations, Lee et al. (2016) noted that this latter interval can be quite long in some cases; cf. Figure 10 in that reference. We here analyze the lengths of these intervals through their (conditional) means and through their quantiles.

We focus here on the original proposal of Lee et al. (2016) for the sake of simplicity and ease of exposition. Nevertheless, our findings also carry over to several recent developments that rely on the polyhedral method and that are mentioned in Section 1.2; see Remark 1(i) and Remark 3.

1.1 Overview of Findings

Throughout, we use the same setting and assumptions as Lee et al. (2016). In particular, we assume that the response vector is distributed as N(μ,σ2In) with unknown mean μRn and known variance σ2>0 (our results carry over to the unknown-variance case; see Section 3.3), and that the nonstochastic regressor matrix has columns in general position. Write Pμ,σ2 and Eμ,σ2 for the probability measure and the expectation operator, respectively, corresponding to N(μ,σ2In).

For the interval [Lm̂,ŝ,Um̂,ŝ], we find the following: Fix a nonempty model m, a sign-vector s, as well as μRn and σ2>0. If Pμ,σ2(m̂=m,ŝ=s)>0, then(1) Eμ,σ2[Um̂,ŝLm̂,ŝ|m̂=m,ŝ=s]=;(1) see Proposition 2 and the attending discussion. Obviously, this statement continues to hold if the event m̂=m,ŝ=s is replaced by the larger event m̂=m throughout. And this statement continues to hold if the condition Pμ,σ2(m̂=m,ŝ=s)>0 is dropped and the conditional expectation in (1) is replaced by the unconditional one.

For the interval [Lm̂,Um̂], we derive a necessary and sufficient condition for its expected length to be infinite, conditional on the event m̂=m; cf. Proposition 3. That condition is never satisfied if the model m is empty or includes only one regressor; it is also typically never satisfied if m includes all available regressors (see Corollary 1). The necessary and sufficient condition depends on the regressor matrix, on the model m and also on a linear contrast that defines the quantity of interest, and is daunting to verify in all but the most basic examples. We also provide a sufficient condition for infinite expected length that is easy to verify. In simulations, we find that this sufficient condition for infinite expected length is typically satisfied except for two somewhat extreme cases: (a) If the Lasso penalty is very large (so that almost all regressors are excluded). (b) If the number of available regressors is not larger than sample size and the Lasso parameter is very small (so that almost no regressor is excluded). See for more detail.

Of course, a confidence interval with infinite expected length can still be quite short with high probability. In our theoretical analysis and in our simulations, we find that the κ-quantiles of Um̂,ŝLm̂,ŝ and Um̂Lm̂ behave like the κ-quantiles of 1/U with UU(0,1), that is, like 1/(1κ), for κ close to 1 if the conditional expected length of these intervals is infinite; cf. Proposition 4, and the attending discussions.

The methods developed in this article can also be used if the Lasso, as the model selector, is replaced by any other procedure that relies on the polyhedral method; see Remark 1(i) and Remark 3. In particular, we see that confidence intervals based on the polyhedral method in Gaussian regression can have infinite expected length. Our findings suggest that the expected length of confidence intervals based on the polyhedral method should be closely scrutinized, in Gaussian regression but also in non-Gaussian settings and other variations of the polyhedral method.

“Length” is arguably only one of several possible criteria for judging the “quality” of valid confidence intervals, albeit one of practical interest. Our focus on confidence interval length is justified by our findings.

The rest of the article is organized as follows: We conclude this section by discussing a number of related results that put our findings in context. Section 2 describes the confidence intervals of Lee et al. (2016) in detail and introduces some notation. Section 3 contains our core results, Propositions 1–4 which entail our main findings, as well as a discussion of the unknown variance case. The simulation studies mentioned earlier are given in Section 4. Finally, in Section 5, we discuss some implications of our findings. In particular, we argue that the computational simplicity of the polyhedral method comes at a price in terms of interval length, and that computationally more involved methods can provide a remedy. The appendix contains the proofs and some auxiliary lemmas.

1.2 Context and Related Results

There are currently several exciting ongoing developments based on the polyhedral method, not least because it proved to be applicable to more complicated settings, and there are several generalizations of this framework (see, among others, Fithian et al. 2015; Gross, Taylor, and Tibshirani 2015; Reid, Taylor, and Tibshirani 2015; Tian and Taylor 2016, 2017; Tian et al. 2016; Tibshirani et al. 2016; Tian, Loftus, and Taylor 2017; Markovic, Xia, and Taylor 2018; Panigrahi, Zhu, and Sabatti 2018; Taylor and Tibshirani 2018; Panigrahi and Taylor 2019). Certain optimality results of the method of Lee et al. (2016) are given in Fithian, Sun, and Taylor (2017). Using a different approach, Berk et al. (2013) proposed the so-called PoSI-intervals which are unconditionally valid. A benefit of the PoSI-intervals is that they are valid after selection with any possible model selector, instead of a particular one like the Lasso; however, as a consequence, the PoSI-intervals are typically very conservative (i.e., the actual coverage probability is above the nominal level). Nonetheless, Bachoc, Leeb, and Pötscher (2019) showed in a Monte Carlo simulation that, in certain scenarios, the PoSI-intervals can be shorter than the intervals of Lee et al. (2016). The results of the present article are based on the first author’s master’s thesis.

It is important to note that all confidence sets discussed so far are nonstandard, in the sense that the parameter to be covered is not the true parameter in an underlying correct model (or components thereof), but instead is a model-dependent quantity of interest. (See Section 2 for details and the references in the preceding paragraph for more extensive discussions.) An advantage of this nonstandard approach is that it does not rely on the assumption that any of the candidate models is correct. Valid inference for an underlying true parameter is a more challenging task, as demonstrated by the impossibility results in Leeb and Pötscher (2006a, 2006b, 2008). There are several proposals of valid confidence intervals after model selection (in the sense that the actual coverage probability of the true parameter is at or above the nominal level) but these are rather large compared to the standard confidence intervals from the full model (supposing that one can fit the full model); see Pötscher (2009), Pötscher and Schneider (2010), and Schneider (2016). In fact, Leeb and Kabaila (2017) showed that the usual confidence interval obtained by fitting the full model is admissible also in the unknown variance case; therefore, one cannot obtain uniformly smaller valid confidence sets for a component of the true parameter by any other method.

2 Assumptions and Confidence Intervals

Let Y denote the N(μ,σ2In)-distributed response vector, n1, where μRn is unknown and σ2>0 is known. Let X=(x1,,xp),p1, with xiRn for each i=1,,p, be the nonstochastic n × p regressor matrix. We assume that the columns of X are in general position (this mild assumption is further discussed in the following paragraph). The full model {1,,p} is denoted by mF. All subsets of the full model are collected in M, that is, M={m:mmF}. The cardinality of a model m is denoted by |m|. For any m={i1,,ik}M with i1<<ik, we set Xm=(xi1,,xik). Analogously, for any vector vRp, we set vm=(vi1,,vik). If m is the empty model, then Xm is to be interpreted as the zero vector in Rn and vm as 0.

The Lasso estimator, denoted by β̂(y), is a minimizer of the least squares problem with an additional penalty on the absolute size of the regression coefficients (Frank and Friedman 1993; Tibshirani 1996):minβRp12yXβ22+λβ1,yRn,λ>0.

The Lasso has the property that some coefficients of β̂(y) are zero with positive probability. A minimizer of the Lasso objective function always exists, but it is not necessarily unique. Uniqueness of β̂(y) is guaranteed here by our assumption that the columns of X are in general position (Tibshirani 2013). This assumption is relatively mild; for example, if the entries of X are drawn from a (joint) distribution that has a Lebesgue density, then the columns of X are in general position with probability 1 (Tibshirani 2013). The model m̂(y) selected by the Lasso and the sign-vector ŝ(y) of nonzero Lasso coefficients can now formally be defined throughm̂(y)={j:β̂j(y)0}andŝ(y)=sign(β̂m̂(y)(y)),

(where ŝ(y) is left undefined if m̂(y)=). Recall that M denotes the set of all possible submodels and set Sm={1,1}|m| for each mM. For later use we also denote by M+ and Sm+ the collection of models and the collection of corresponding sign-vectors, that occur with positive probability, that is, M+={mM:Pμ,σ2(m̂(Y)=m)>0},Sm+={sSm:Pμ,σ2(m̂(Y)=m,ŝ(Y)=s)>0}(mM).

These sets do not depend on μ and σ2 as the measure Pμ,σ2 is equivalent to Lebesgue measure with respect to null sets. Also, our assumption that the columns of X are in general position guarantees that M+ only contains models m for which Xm has column-rank m (Tibshirani 2013).

Inference is focused on a nonstandard, model dependent, quantity of interest. Consider first the nontrivial case where mM+{}. In that case, we setβm=Eμ,σ2[(XmXm)1XmY]=(XmXm)1Xmμ.

For γmR|m|{0}, the goal is to construct a confidence interval for γmβm with conditional coverage probability 1α on the event {m̂=m}. Clearly, the quantity of interest can also be written as γmβm=ηmμ for ηm=Xm(XmXm)1γm. For later use, write Pηm for the orthogonal projection on the space spanned by ηm. Finally, for the trivial case where m=, we set β=γ=η=0.

At the core of the polyhedral method lies the observation that the event where m̂=m and where ŝ=s describes a convex polytope in sample space Rn (up to a Lebesgue null set): Fix mM+{} and sSm+. Then(2) {y:m̂(y)=m,ŝ(y)=s}=a.s.{y:Am,sy<bm,s};(2) see Theorem 3.3 in Lee et al. (2016) (explicit formulas for the matrix Am,s and the vector bm,s are also repeated in Appendix C in our notation). Fix zRn orthogonal to ηm. Then the set of y satisfying (InPηm)y=z and Am,sy<b is either empty or a line segment. In either case, that set can be written as {z+ηmw:Vm,s(z)<w<Vm,s+(z)}. The endpoints satisfy Vm,s(z)Vm,s+(z) [see Lemma 4.1 of Lee et al. (2016); formulas for these quantities are also given in Appendix C in our notation]. Now decompose Y into the sum of two independent Gaussians PηmY and (InPηm)Y, where the first one is a linear function of ηmYN(ηmμ,σ2ηmηm). With this, the conditional distribution of ηmY, conditional on the event {m̂(Y)=m,ŝ(Y)=s,(InPηm)=z}, is the conditional N(ηmμ,σ2ηmηm)-distribution, conditional on the set (Vm,s(z),Vm,s+(z)) (in the sense that the latter conditional distribution is a regular conditional distribution if one starts with the conditional distribution of ηmY given m̂=m and ŝ=s—which is always well-defined—and if one then conditions on the random variable (InPηm)Y).

Fig. D1 For n = 2, the sample space R2 is partitioned corresponding to the model and the sign-vector selected by the Lasso when λ = 2 and X=(x1:x2), with x1=(1,0) and x2=(1,1). We set m={1,2} and γm=(1,1), so that ηm=x1. The point y lies on the black line segment {z+ηmv:vTm(z)} for z=(I2Pη)y, which is bounded on the left. In particular, Tm(z) is bounded. For the point y˜, the corresponding black line segments together are unbounded on both sides, and hence Tm((I2Pηm)y˜) is unbounded.

Fig. D1 For n = 2, the sample space R2 is partitioned corresponding to the model and the sign-vector selected by the Lasso when λ = 2 and X=(x1:x2), with x1=(1,0)′ and x2=(1,1)′. We set m={1,2} and γm=(1,1), so that ηm=x1. The point y lies on the black line segment {z+ηmv:v∈Tm(z)} for z=(I2−Pη)y, which is bounded on the left. In particular, Tm(z) is bounded. For the point y˜, the corresponding black line segments together are unbounded on both sides, and hence Tm((I2−Pηm)y˜) is unbounded.

To use these observations for the construction of confidence sets, consider first the conditional distribution of a random variable VN(θ,ς2) conditional on the event VT, where θR, where ς2>0 and where T is the union of finitely many open intervals. The intervals may be unbounded. Write Fθ,ς2T(·) for the cumulative distribution function (cdf) of V given VT. The corresponding law can be viewed as a “truncated normal” distribution and will be denoted by TN(θ,ς2,T) in the following. We will construct a confidence interval based on W, where WTN(θ,ς2,T). Such an interval, which covers θ with probability 1α, is obtained by the usual method of collecting all values θ0 for which a hypothesis test of H0:θ=θ0 against H1:θθ0 does not reject, based on the observation WTN(θ,ς2,T). In particular, for wR, define L(w) and U(w) throughFL(w),ς2T(w)=1α2andFU(w),ς2T(w)=α2,which are well-defined in view of Lemma A.2. With this, we have P(θ[L(W),U(W)])=1α irrespective of θR.

Fix mM+{} and sSm+, and let σm2=σ2ηmηm and Tm,s(z)=(Vm,s(z),Vm,s+(z)) for z orthogonal to ηm. With this, we have(3) ηmY|{m̂=m,ŝ=s,(InPηm)Y=z}TN(ηmμ,σm2,Tm,s(z))(3) for each z{(InPηm)y:Am,sy<bm,s}. Now define Lm,s(y) and Um,s(y) throughFLm,s(y),σm2Tm,s((InPηm)y)(ηmy)=1α2andFUm,s(y),σm2Tm,s((InPηm)y)(ηmy)=α2

for each y so that Am,sy<bm,s. By the considerations in the preceding paragraph, it follows that(4) Pμ,σ2(ηmμ[Lm,s(Y),Um,s(Y)]|m̂=m,ŝ=s,(InPηm)Y=z)=1α.(4)

Clearly, the random interval [Lm,s(Y),Um,s(Y)] covers γmβm=ηmμ with probability 1α also conditional on the event that m̂=m and ŝ=s or on the event that m̂=m.

In a similar fashion, fix mM+. In the nontrivial case where m, we set Tm(z)=sSm+Tm,s(z) for z orthogonal to ηm, and define Lm(y) and Um(y) throughFLm(y),σm2Tm((InPηm)y)(ηmy)=1α2andFUm(y),σm2Tm((InPηm)y)(ηmy)=α2.

Arguing as in the preceding paragraph, we see that the random interval [Lm(Y),Um(Y)] covers γmβm=ηmμ with probability 1α conditional on any of the events {m̂=m,(InPηm)Y=z} and {m̂=m}. In the trivial case where m=, we set [L(Y),R(Y)]={0} with probability 1α and [L(Y),R(Y)]={1} with probability α, so that similar coverage properties also hold in that case. The unconditional coverage probability of the interval [Lm̂(Y),Rm̂(Y)] then also equals 1α.

Remark 1.

  1. If m˜=m˜(y) is any other model selection procedure, so that the event {m˜=m} can be represented as the union of a finite number of polyhedra (up to null sets), then the polyhedral method can be applied to obtain a confidence set for ηmμ with conditional coverage probability 1α, conditional on the event {m˜=m}, if that event has positive probability.

  2. We focus here on equal-tailed confidence intervals for the sake of brevity. It is easy to adapt all our results to the unequal-tailed case, that is, the case where α/2 and 1α/2 are replaced by α1 and 1α2 with only minor modifications of the proofs, provided that α1 and α2 are are both in (0,1/2]. (The remaining case, in which 1/2<α1+α2<1, is of little interest, because the corresponding coverage probability is 1α1α2<1/2 here, and is left as an exercise.) Another alternative, the uniformly most accurate unbiased interval, is discussed at the end of Section 5.

  3. In Theorem 3.3 of Lee et al. (2016), relation (2) is stated as an equality, not as an equality up to null sets, and with the right-hand side replaced by {y:Am,sybm,s} (in our notation). Because (2) differs from this only on a Lebesgue null set, the difference is inconsequential for the purpose of the present article. The statement in Lee et al. (2016) is based on the fact that m̂ was defined as the equicorrelation set (Tibshirani 2013) in that article. But if m̂ is the equicorrelation set, then there can exist vectors y{m̂=m} such that some coefficients of β̂(y) are zero, which clashes with the idea that m̂ contains those variables whose Lasso coefficients are nonzero. However, for any mM+, the set of such y’s is a Lebesgue null set.

3 Analytical Results

3.1 Mean Confidence Interval Length

We first analyze the simple confidence set [L(W),U(W)] introduced in the preceding section, which covers θ with probability 1α, where WTN(θ,ς2,T). By assumption, T is of the form T=i=1K(ai,bi) where K< and a1<b1<<aK<bK. exemplifies the length of [L(w),U(w)] when T is bounded (left panel) and when T is unbounded (right panel). The dashed line corresponds to the length of the standard (unconditional) confidence interval for θ based on VN(θ,ς2). In the left panel, we see that the length of [L(w),U(w)] diverges as w approaches the far left or the far right boundary point of the truncation set (i.e., –3 and 3). On the other hand, in the right panel we see that the length of [L(w),U(w)] is bounded and converges to the length of the standard interval as |w|.

Fig. 1 Length of the interval [L(w),U(w)] for the case where T=(3,2)(1,1)(2,3) (left panel) and the case where T=(,2)(1,1)(2,) (right panel). In both cases, we took ς2=1 and α=0.05.

Fig. 1 Length of the interval [L(w),U(w)] for the case where T=(−3,−2)∪(−1,1)∪(2,3) (left panel) and the case where T=(−∞,−2)∪(−1,1)∪(2,∞) (right panel). In both cases, we took ς2=1 and α=0.05.

Write Φ(w) and ϕ(w) for the cdf and pdf of the standard normal distribution, respectively, where we adopt the usual convention that Φ()=0 and Φ()=1.

Proposition 1

(The interval [L(W),U(W)] for truncated normal W). Let WTN(θ,ς2,T). If T is bounded either from above or from below, thenE[U(W)L(W)]=.

If T is unbounded from above and from below, thenU(W)L(W)ςa.s.2Φ1(1pα/2)2Φ1(1α/2)+aKb1ς,where p=infϑRP(N(ϑ,ς2)T) and where aKb1 is to be interpreted as 0 in case K = 1. [The first inequality trivially continues to hold if T is bounded, as then p=0.]

Intuitively, one expects confidence intervals to be wide if one conditions on a bounded set because extreme values cannot be observed on a bounded set and the confidence intervals have to take this into account. We find that the conditional expected length is infinite in this case. If, for example, T is bounded from below, that is, if <a1, then the first statement in the proposition follows from two facts: First, the length of U(w)L(w) behaves like 1/(wa1) as w approaches a1 from above; and, second, the pdf of the truncated normal distribution at w is bounded away from 0 as w approaches a1 from above. See the proof in Section B for a more detailed version of this argument. On the other hand, if the truncation set is unbounded, extreme values are observable and confidence intervals, therefore, do not have to be extremely wide. The second upper bound provided by the proposition for that case will be useful later.

We see that the boundedness of the truncation set T is critical for the interval length. When the Lasso is used as a model selector, this prompts the question whether the truncation sets Tm,s(z) and Tm(z) are bounded or not, because the intervals [Lm,s(y),Um,s(y)] and [Lm(y),Um(y)] are obtained from conditional normal distributions with truncation sets Tm,s((InPηm)y) and Tm((InPηm)y), respectively. For mM+{},sSm+, and z orthogonal to ηm, recall that Tm,s(z)=(Vm,s(z),Vm,s+(z)), and that Tm(z) is the union of these intervals over sSm+. Write [ηm] for the orthogonal complement of the span of ηm.

Proposition 2

(The interval [Lm̂,ŝ(Y),Um̂,ŝ(Y)] for the Lasso). For each mM+{} and each sSm, we havez[ηm]:<Vm,s(z)orz[ηm]:Vm,s+(z)<or both.

For the confidence interval [Lm̂,ŝ(Y),Um̂,ŝ(Y)], the statement in (1) now follows immediately: If m is a nonempty model and s is a sign-vector so that the event {m̂=m,ŝ=s} has positive probability, then mM+{} and sSm+. Now Proposition 2 entails that Tm,s((InPηm)Y) is almost surely bounded on the event {m̂=m,ŝ=s}, and Proposition 1 entails that (1) holds.

For the confidence interval [Lm̂(Y),Um̂(Y)], we obtain that its conditional expected length is finite, conditional on m̂=m with mM+{}, if and only if its corresponding truncation set Tm(Y) is almost surely unbounded from above and from below on that event. More precisely, we have the following result.

Proposition 3

(The interval [Lm̂(Y),Um̂(Y)] for the Lasso). For mM+{}, we have(5) Eμ,σ2[Um̂(Y)Lm̂(Y)|m̂=m]=(5)

if and only if there exists a sSm+ and a vector y satisfying Am,sy<bm,s, so that(6) Tm((InPηm)y) is bounded from above or from below.(6)

To infer (5) from (6), that latter condition needs to be checked for every point y in a union of polyhedra. While this is easy in some simple examples like, say, the situation depicted in of Lee et al. (2016), searching over polyhedra in Rn is hard in general. In practice, one can use a simpler sufficient condition that implies (5): After observing the data, that is, after observing a particular value y of Y, and hence also observing m̂(y)=m and ŝ(y)=s, we check whether Tm((InPηm)y) is bounded from above or from below (and also whether m and whether Am,sy<bm,s, which, if satisfied, entails that mM+ and that sSm+). If this is the case, then it follows, ex post, that (5) holds. Note that these computations occur naturally during the computation of [Lm(y),Um(y)] and can hence be performed as a safety precaution with little extra effort.

The next result shows that the expected length of [Lm̂(Y),Um̂(Y)] is typically finite conditional on m̂=m if the selected model m is either extremely large or extremely small.

Corollary 1

(The interval [Lm̂(Y),Um̂(Y)] for the Lasso). If |m|=0 or |m|=1, we always have Eμ,σ2[Um̂(Y)Lm̂(Y)|m̂=m]<; the same is true if |m|=p for Lebesgue-almost all γm (recall that [Lm̂(Y),Um̂(Y)] is meant to cover γmβm conditional on m̂=m).

The corollary raises the suspicion that the conditional expected length of [Lm̂(Y),Um̂(Y)] could also be finite if the selected model m either includes almost no regressor (|m| close to zero) or excludes almost no regressor (|m| close to p). Our simulations seem to support this; see . The statement concerning Lebesgue-almost all γm does not necessarily hold for all γm; see Remark D.1. Also note that the case where |m̂|=p can only occur if pn, because the Lasso only selects models with no more than n variables here.

Remark 2.

We stress that property (5) or, equivalently, (6), only depends on the selected model m and on the regressor matrix X but not on the parameters μ and σ2 (which govern the distribution of Y). These parameters will, of course, impact the probability that the model m is selected in the first place. But conditional on m̂=m, they have no influence on whether or not the interval [Lm̂(Y),Um̂(Y)] has infinite expected length.

3.2 Quantiles of Confidence Interval Length

Both the intervals [Lm̂,ŝ(Y),Um̂,ŝ(Y)] and [Lm̂(Y),Um̂(Y)] are based on a confidence interval derived from the truncated normal distribution. We therefore first study the length of the latter through its quantiles and then discuss the implications of our findings for the intervals [Lm̂,ŝ(Y),Um̂,ŝ(Y)] and [Lm̂(Y),Um̂(Y)].

Consider WTN(θ,ς2,T) with T being the union of finitely many open intervals, and recall that [L(W),U(W)] covers θ with probability 1α. Define qθ,ς2(κ) throughqθ,ς2(κ)=inf{xR:P(U(W)L(W)x)κ}for 0<κ<1; that is, qθ,ς2(κ) is the κ-quantile of the length of [L(W),U(W)]. If T is unbounded from above and from below, then U(W)L(W) is bounded (almost surely) by Proposition 1; in this case, qθ,ς2(κ) is trivially bounded in κ. For the remaining case, that is, if T is bounded from above or from below, we have E[U(W)L(W)]= by Proposition 1, and the following result provides an approximate lower bound for the κ-quantile qθ,ς2(κ) for κ close to 1.

Proposition 4.

If b=supT<, thenrθ,ς2(κ)=ςlog(2αα)1κϕ(bθς)Φ(bθς)is an asymptotic lower bound for qθ,ς2(κ) in the sense that limsupκ1rθ,ς2(κ)/qθ,ς2(κ)1. If a=infT>, then this statement continues to hold if, in the definition of rθ,ς2(κ), the last fraction is replaced by ϕ((aθ)/ς)/(1Φ((aθ)/ς).

We see that qθ,ς2(κ) goes to infinity at least as fast as O(1/(1κ)) as κ approaches 1 if T is bounded. Moreover, if b=supT<, then rθ,ς2(κ) goes to infinity as O(θ) as θ (cf. the end of the proof of Lemma A.4 in Appendix A), and a similar phenomenon occurs if a=infT> and as θ. (In a model-selection context, the case where θT often corresponds to the situation where the selected model is incorrect.) The approximation provided by Proposition 4 is visualized in for some specific scenarios.

Fig. 2 Approximate lower bound for qθ,ς2(κ) from Proposition 4 for α=0.05,T=(,0] and ς2=1. Starting from the bottom, the curves correspond to θ=2,1,0.

Fig. 2 Approximate lower bound for qθ,ς2(κ) from Proposition 4 for α=0.05,T=(−∞,0] and ς2=1. Starting from the bottom, the curves correspond to θ=−2,−1,0.

Proposition 4 a

lso provides an approximation to the quantiles of Um̂,ŝ(Y)Lm̂,ŝ(Y), conditional on the event {m̂=m,ŝ=s,(InPηm)Y=z} whenever mM+{} and sSm+. Indeed, the corresponding κ-quantile is equal to qηmμ,σm2,Tm,s(z)(κ) in view of (3) and by construction, and Proposition 4 provides an asymptotic lower bound. In a similar fashion, the κ-quantile of Um̂(Y)Lm̂(Y) conditional on the event {m̂=m,(InPηm)Y=z} is given by qηmμ,σm2,Tm(z)(κ) whenever mM+{} and (5) holds. Approximations to the quantiles of Um̂,ŝ(Y)Lm̂,ŝ(Y) conditional on smaller events like {m̂=m,ŝ=s} or {m̂=m} are possible but would involve integration over the range of z in the conditioning events; in other words, such approximations would depend on the particular geometry of the polyhedron {m̂=m,ŝ=s}Rn; cf. (2). Similar considerations apply to the quantiles of Um̂(Y)Lm̂(Y). However, comparing Figure 2 with the simulation results in Figure 4 of Section 4.2, we see that the behavior of rθ,ς2(κ) also is qualitatively similar to the behavior of unconditional κ-quantiles obtained through simulation, at least for κ close to 1.

Remark 3.

If m˜ is any other model selection procedure, so that the event {m˜=m} is the union of a finite number of polyhedra (up to null sets), then the polyhedral method can be applied to obtain a confidence set for ηmμ with conditional coverage probability 1α, conditional on the event {m˜=m} if that event has positive probability. In that case, Propositions 3 and 4 can be used to analyze the length of corresponding confidence intervals that are based on the polyhedral method: Clearly, for such a model selection procedure, an equivalence similar to (5)–(6) in Proposition 3 holds, with the Lasso-specific set Tm((InPηm)y) replaced by a similar set that depends on the event {m˜=m}. And conditional quantiles of confidence interval length are again of the form qθ,ς2(κ) for appropriate choice of θ, ς2, and T, for which Proposition 4 provides an approximate lower bound; cf. the discussion following the proposition. Examples include Fithian et al. (2015, sec. 5), Fithian, Sun, and Taylor (2017, sec. 4), or Reid, Taylor, and Tibshirani (2015, sec. 6). See also Tian, Loftus, and Taylor (2017, sec. 3.1) and Gross, Taylor, and Tibshirani (2015, sec. 5.1), where the truncated normal distribution is replaced by truncated t- and F-distributions, respectively.

3.3 The Unknown Variance Case

Suppose here that σ2>0 is unknown and that σ̂2 is an estimator for σ2. Fix mM+{} and sSm+. Note that the set Am,sy<bm,s does not depend on σ2 and hence also Vm,s((InPηm)y) and Vm,s+((InPηm)y) do not depend on σ2. For each ς2>0 and for each y so that Am,sy<bm,s define Lm,s(y,ς2),Um,s(y,ς2), Lm(y,ς2), and Um(y,ς2) like Lm,s(y),Um,s(y),Lm(y), and Um(y) in Section 2 with ς2 replacing σ2 in the formulas. (Note that, say, Lm,s(y) depends on σ2 through σm2=σ2ηmηm.) The asymptotic coverage probability of the intervals [Lm,s(Y,σ̂2),Um,s(Y,σ̂2)] and [Lm(Y,σ̂2),Um(Y,σ̂2)], conditional on the events {m̂=m,ŝ=s} and {m̂=m}, respectively, was discussed in Lee et al. (2016).

If σ̂2 is independent of ηmY and positive with positive probability, then it is easy to see that (1) continues to hold with [Lm,s(Y,σ̂2),Um,s(Y,σ̂2)] replacing [Lm,s(Y),Um,s(Y)] for each mM+ and each sSm+. And if, in addition, σ̂2 has finite mean conditional on the event {m̂=m} for mM+, then it is elementary to verify that the equivalence (5)–(6) continues to hold with [Lm(Y,σ̂2),Um(Y,σ̂2)] replacing [Lm(Y),Um(Y)] (upon repeating the arguments following (5)–(6) and upon using the finite conditional mean of σ̂2 in the last step).

In the case where p < n, the usual variance estimator ||YX(XX)1XY||2/(np) is independent of ηmY, is positive with probability 1 and has finite unconditional (and hence also conditional) mean. For variance estimators in the case where pn, we refer to Lee et al. (2016) and the references therein.

4 Simulation Results

4.1 Mean of Um̂Lm̂

We seek to investigate whether or not the expected length of [Lm̂,Um̂] is typically infinite, that is, to which extent the property of the interval [Lm̂,ŝ,Um̂,ŝ], as described in Proposition 2, carries over to [Lm̂,Um̂], which is characterized in Proposition 3. To this end, we perform an exploratory simulation exercise consisting of 500 repeated samples of size n = 100 for various configurations of p and λ, that is, for models with varying number of parameters p and for varying choices of the tuning parameter λ. The quantity of interest here is the first component of the parameter corresponding to the selected model. For each sample yRn, we compute the Lasso estimator β̂(y), the selected model m̂(y), and the confidence interval [Lm̂(y),Um̂(y)] for β1m̂(y). Finally, we check whether |m|>1 and whether the sufficient condition for infinite expected length outlined after Proposition 3 is satisfied. If so, the interval [Lm̂(Y),Um̂(Y)] is guaranteed to have infinite expected length conditional on the event m̂(Y)=m, irrespective of the true parameters in the model. The results, averaged over 500 repetitions for each configuration of p and λ, are reported in .

Fig. 3 Heat-map showing the fraction of cases (out of 500 runs) in which we found a model m for which the confidence interval [Lm̂(Y),Um̂(Y)] for β1m̂(Y) is guaranteed to have infinite expected length conditional on m̂=m, for various values of p and λ. For those cases where infinite expected length is not guaranteed, the number in the corresponding cell shows the percentage of variables (out of p) in the smallest and in the largest selected model.

Fig. 3 Heat-map showing the fraction of cases (out of 500 runs) in which we found a model m for which the confidence interval [Lm̂(Y),Um̂(Y)] for β1m̂(Y) is guaranteed to have infinite expected length conditional on m̂=m, for various values of p and λ. For those cases where infinite expected length is not guaranteed, the number in the corresponding cell shows the percentage of variables (out of p) in the smallest and in the largest selected model.

We see that the conditional expected length of [Lm̂(Y),Um̂(Y)] is guaranteed to be infinite in a substantial number of cases (corresponding to the dark cells in the figure). The white cells correspond to cases where the sufficient condition for infinite expected length is not met. These correspond to simulation scenarios where either (a) pn and λ is quite small or (b) λ is quite large. In the first (resp. second) case, most regressors are included (resp. excluded) in the selected model with high probability.

A more detailed description of the simulation underlying is as follows: For each simulation scenario, that is, for each cell in the figure, we generate an n × p regressor matrix X whose rows are independent realizations of a p-variate Gaussian distribution with mean zero, so that the diagonal elements of the covariance matrix all equal 1 and the off-diagonal elements all equal 0.2. Then we choose a vector βRp so that the first p/2 components are equal to 1/n and the last p/2 components are equal to zero. Finally, we generate 500 n-vectors yi=Xβ+ui, where the ui are independent draws from the N(0,In)-distribution, compute the Lasso estimators β̂(yi) and the resulting selected models mi=m̂(yi). We then check if |mi|>1 and if the interval [Lmi(yi),Umi(yi)] satisfies the sufficient condition outlined after Proposition 3 with ηmi=Xmi(XmiXmi)1e1, where e1 is the first canonical basis vector in R|mi|. This corresponds to the quantity of interest being β1mi, that is, the first component of the parameter corresponding to the selected model. If said condition is satisfied, the confidence set [Lm̂(Y),Um̂(Y)] is guaranteed to have infinite expected length conditional on the event that m̂=mi (and hence also unconditional). The fraction of indices i, 1i500, for which this is the case, are displayed in the cells of . If this fraction is below 100%, we report, in the corresponding cell, min|mi|/p and max|mi|/p, where the minimum and the maximum are taken over those cases i for which the sufficient condition is not met.

We stress here that the choice of β does not have an impact on whether or not a model m is such that the mean of Um̂(Y)Lm̂(Y) is finite conditional on m̂=m. Indeed, the characterization in Proposition 3 as well as the sufficient condition that we check do not depend on β. The choice of β does have an impact, however, on the probability that a given model m is selected in our simulations.

4.2 Quantiles of Um̂Lm̂

We approximate the quantiles of Um̂Lm̂ through simulation as follows: For n = 100, p = 14, and λ = 10, we choose βRp proportional to (1,0,1,0,,1,0) so that ||β||{0,p/2/10,p/2}. For each choice of β, we generate an n-vector y as described in the preceding section, compute m=m̂(y) and the interval [Lm(y),Um(y)] for β1m, and record its length. This is repeated 10,000 times. The resulting empirical quantiles are shown in .

Fig. 4 Simulated κ-quantiles. The black curves are functions of the form (a+bκ)/(1κ), with a and b fitted by least squares. Starting from the bottom, the curves and the corresponding empirical quantiles correspond to ||β|| equal to 0, p/2/10 and p/2.

Fig. 4 Simulated κ-quantiles. The black curves are functions of the form (a+bκ)/(1−κ), with a and b fitted by least squares. Starting from the bottom, the curves and the corresponding empirical quantiles correspond to ||β|| equal to 0, p/2/10 and p/2.

suggests that the unconditional κ-quantiles also grow like 1/(1κ) for κ approaching 1. This growth-rate was already observed in Proposition 4 for conditional quantiles. Also, the unconditional κ-quantiles increase as ||β|| increases, which is again consistent with that proposition. Repeating this simulation for a range of other choices for p, β, and λ gave qualitatively similar results, which are not shown here for the sake of brevity. For these other choices, the corresponding κ-quantiles decrease as the probability of selecting either a very small model or an almost full model increases, and vice versa. This is consistent with our findings from Corollary 1 and .

5 Discussion

The polyhedral method can be used whenever the conditioning event of interest can be represented as a polyhedron. And our results can be applied whenever the polyhedral method is used for constructing confidence intervals. Besides the Lasso, this also includes other model selection methods as well as some recent proposals related to the polyhedral method that are mentioned in Remark 3.

By construction, the polyhedral method gives intervals like [Lm̂,ŝ,Um̂,ŝ] and [Lm̂,Um̂] that are derived from a confidence set based on a truncated univariate distribution (in our case, a truncated normal). Through this, the former intervals are rather easy to compute. And through this, the former intervals are valid conditional on quite small events, namely {m̂=m,ŝ=s,(InPηm)Y=z} and {m̂=m,(InPηm)Y=z}, respectively, which is a strong property; see EquationEquation (4). But through this, the former intervals also inherit the property that their length can be quite large. This undesirable property is inherited through the conditioning on (InPηm)Y. Example 3 in Fithian, Sun, and Taylor (2017) demonstrates that requiring validity only on larger events, like {m̂=m,ŝ=s} or {m̂=m}, can result in much shorter intervals. But when conditioning on these larger events, the underlying reference distribution is no longer a univariate truncated distribution but an n-variate truncated distribution. Computations involving the corresponding n-variate cdf are much harder than those in the univariate case.

A recently proposed construction, selective inference with a randomized response, provides higher power of hypothesis tests conditional on the outcome of the model selection step, and hence also improved confidence sets based on these tests; cf. Tian and Taylor (2016) and, in particular, in that reference. This increase in power is obtained by decreasing the “power” of the model selection step itself, in the sense that the model selector m̂(y) is replaced by m̂(y+ω), where ω represents additional randomization that is added to the data. Again, finite-sample computations are demanding in that setting compared to the simple polyhedral method (see Tian and Taylor 2016, sec. 4.2.2).

Another alternative construction, uniformly most accurate unbiased (UMAU) confidence intervals should be mentioned here. When the data-generating distribution belongs to an exponential family, UMAU intervals can be constructed conditional on events of interest like {m̂=m} or on smaller events like {m̂=m,(InPηm)Y=z}; cf. Fithian, Sun, and Taylor (2017). In either case, UMAU intervals require more involved computations than the equal-tailed intervals considered here.

Acknowledgments

We thank the associate editor and two referees, whose feedback has led to significant improvements of the article. Also, helpful input from Nicolai Amann is greatly appreciated.

References

Appendix A:

Auxiliary Results

In this section, we collect some properties of functions like Fθ,ς2T(w) that will be needed in the proofs of Propositions 1 and 2. The following result will be used repeatedly in the following and is easily verified using L’Hospital’s method.

Lemma A.1.

For all a, b with a<b, the following holdslimθΦ(aθ)Φ(bθ)=0.

Write Fθ,ς2T(w) and fθ,ς2T(w) for the cdf and pdf of the TN(θ,ς2,T)-distribution, where T=i=1K(ai,bi) with a1<b1<a2<<aK<bK. For wT and for k so that ak<w<bk, we haveFθ,ς2T(w)=Φ(wθς)Φ(akθς)+i=1k1Φ(biθς)Φ(aiθς)i=1KΦ(biθς)Φ(aiθς);if k = 1, the sum in the numerator is to be interpreted as 0. And for w as above, the density fθ,ς2T(w) is equal to ϕ((wθ)/ς)/ς divided by the denominator in the preceding display.

Lemma A.2.

For each fixed wT, Fθ,ς2T(w) is continuous and strictly decreasing in θ, andlimθFθ,ς2T(w)=1andlimθFθ,ς2T(w)=0.

Proof.

Continuity is obvious and monotonicity has been shown in Lee et al. (2016) for the case where T is a single interval, that is, K = 1; it is easy to adapt that argument to also cover the case K > 1. Next consider the formula for Fθ,ς2T(w). As θ, Lemma A.1 implies that the leading term in the numerator is Φ((wθ)/ς) while the leading term in the denominator is Φ((bKθ)/ς). Using Lemma A.1 again gives limθFθ,ς2T(w)=0. Finally, it is easy to see that Fθ,ς2T(w)=1Fθ,ς2T(w) (upon using the relation Φ(t)=1Φ(t) and a little algebra). With this, we also obtain that limθFθ,ς2T(w)=1. □

For γ(0,1) and wT, define Qγ(w) throughFQγ(w),ς2T(w)=γ.

Lemma A.2

ensures that Qγ(w) is well-defined. Note that L(w)=Q1α/2(w) and U(w)=Qα/2(w).

Lemma A.3.

For fixed wT,Qγ(w) is strictly decreasing in γ on (0, 1). And for fixed γ(0,1),Qγ(w) is continuous and strictly increasing in wT so that limwa1Qγ(w)= and limwbKQγ(w)=.

Proof.

Fix wT. Strict monotonicity of Qγ(w) in γ follows from strict monotonicity of Fθ,ς2T(w) in θ; cf. Lemma A.2.

Fix γ(0,1) throughout the following. To show that Qγ(·) is strictly increasing on T, fix w,wT with w<w. We getγ=FQγ(w),ς2T(w)<FQγ(w),ς2T(w),where the inequality holds because the density of FQγ(w),ς2T(·) is positive on T. The definition of Qγ(w) and Lemma A.2 entail that Qγ(w)<Qγ(w).

To show that Qγ(·) is continuous on T, we first note that Fθ,ς2T(w) is continuous in (θ,w)R×T (which is easy to see from the formula for Fθ,ς2T(w) given after Lemma A.1). Now fix wT. Because Qγ(·) is monotone, it suffices to show that Qγ(wn)Qγ(w) for any increasing sequence wn in T converging to w from below, and for any decreasing sequence wn converging to w from above. If the wn increase toward w from below, the sequence Qγ(wn) is increasing and bounded by Qγ(w) from above, so that Qγ(wn) converges to a finite limit Q¯. With this, and because Fθ,ς2T(w) is continuous in (θ,w), it follows thatlimnFQγ(wn),ς2T(wn)=FQ¯,ς2T(w).

In the preceding display, the sequence on the left-hand side is constant equal to γ by definition of Qγ(wn), so that FQ¯,ς2T(w)=γ. It follows that Q¯=Qγ(w). If the wn decrease toward w from above, a similar argument applies.

To show that limwbKQγ(w)=, let wn, n1, be an increasing sequence in T that converges to bK. It follows that Qγ(wn) converges to a (not necessarily finite) limit Q¯ as n. If Q¯<, we get for each b<bK thatliminfnFQγ(wn),ς2T(wn)liminfnFQγ(wn),ς2T(b)=FQ¯,ς2T(b).

In this display, the inequality holds because FQγ(wn),ς2T(·) is a cdf, and the equality holds because Fθ,ς2T(b) is continuous in θ. As this holds for each b<bK, we obtain that liminfnFQγ(wn),ς2T(wn)=1. But in this equality, the left-hand side equals γ—a contradiction. By similar arguments, it also follows that limwa1Qγ(w)=. □

Lemma A.4.

The function Qγ(·) satisfieslimwbK(bKw)Qγ(w)=ς2log(γ)if bK< andlimwa1(a1w)Qγ(w)=ς2log(1γ)if a1>.

Proof. A

s both statements follow from similar arguments, we only give the details for the first one. As w approaches bK from below, Qγ(w) converges to by Lemma A.3. This observation, the fact that FQγ(w),ς2T(w)=γ holds for each w, and Lemma A.1 together imply thatlimwbKΦ(wQγ(w)ς)Φ(bKQγ(w)ς)=γ.

Because Φ(x)/(ϕ(x)/x)1 as x (cf. Feller 1957, Lemma VII.1.2.), we get thatlimwbKϕ(wQγ(w)ς)ϕ(bKQγ(w)ς)=γ.

The claim now follows by plugging-in the formula for ϕ(·) on the left-hand side, simplifying, and then taking the logarithm of both sides. □

Appendix B

Proof of Proposition 1

Proof

of the first statement in Proposition 1. Assume that bK< (the case where a1> is treated similarly). Lemma A.4 entails that limwbK(bKw)(U(w)L(w))=ς2C, where C=log((1α/2)/(α/2)) is positive. Hence, there exists a constant ϵ>0 so thatU(w)L(w)>12ς2CbKwwhenever w(bKϵ,bK)T. Set B=inf{fθ,ς2T(w):w(bKϵ,bK)T}. For wT, fθ,ς2T(w) is a Gaussian density divided by a constant scaling factor, so that B > 0. Because U(w)L(w)0 in view of Lemma A.3, we obtain thatEθ,ς2[U(W)L(W)|WT]ς2BC2(bKϵ,bK)T1bKwdw=.

Proof

of the first inequality in Proposition 1. Define Rγ(w) through Φ((wRγ(w))/ς)=γ, that is, Rγ(w)=wςΦ1(γ) Then, on the one hand, we haveFRγ(w),ς2T(w)=P(N(Rγ(w),ς2)w,N(Rγ(w),ς2)T)P(N(Rγ(w),ς2)T)P(N(Rγ(w),ς2)w)infϑP(N(ϑ,ς2)T)=γp,while, on the other,FRγ(w),ς2T(w)P(N(Rγ(w),ς2)w)P(N(Rγ(w),ς2)T)P(N(Rγ(w),ς2)T)infϑP(N(Rγ(w),ς2)w)1+P(N(ϑ,ς2)T)P(N(ϑ,ς2)T)=γ1+pp.

The inequalities in the preceding two displays imply thatR1p(1γ)(w)Qγ(w)Rpγ(w).

(Indeed, the inequality in the third-to-last display continues to hold with pγ replacing γ; in that case, the upper bound reduces to γ; similarly, the inequality in the second-to-last display continues to hold with 1p(1γ) replacing γ, in which case the lower bound reduces to γ. Now use the fact that Fθ,ς2T(w) is decreasing in θ.) In particular, we get that U(w)=Qα/2(w)Rpα/2(w)=wςΦ1(pα/2) and that L(w)=Q1α/2(w)R1pα/2(w)=wςΦ1(1pα/2). The last two inequalities, and the symmetry of Φ(·) around zero, imply the first inequality in the proposition. □

Proof

of the second inequality in Proposition 1. Note that pp°=infϑP(N(ϑ,ς2)<b1 or N(ϑ,ς2)>aK), because T is unbounded above and below. Setting δ=(aKb1)/(2ς), we note that δ0 and that it is elementary to verify that p°=2Φ(δ). Because Φ1(1pα/2)Φ1(1p°α/2), the inequality will follow if we can show that Φ1(1p°α/2)Φ1(1α/2)+δ or, equivalently, that Φ1(p°α/2)Φ1(α/2)δ. Because Φ(·) is strictly increasing, this is equivalent top°α/2=Φ(δ)αΦ(Φ1(α/2)δ).

To this end, we set f(δ)=αΦ(δ)/Φ(Φ1(α/2)δ) and show that f(δ)1 for δ0. Because f(0)=1, it suffices to show that f(δ) is nonnegative for δ>0. The derivative can be written as a fraction with positive denominator and with numerator equal toαϕ(δ)Φ(Φ1(α/2)δ)+αΦ(δ)ϕ(Φ1(α/2)δ).

The expression in the preceding display is nonnegative if and only ifΦ(δ)ϕ(δ)Φ(Φ1(α/2)δ)ϕ(Φ1(α/2)δ).

This will follow if the function g(x)=Φ(x)/ϕ(x) is decreasing in x0. The derivative g(x) can be written as a fraction with positive denominator and with numerator equal toϕ(x)2+xΦ(x)ϕ(x)=xϕ(x)(Φ(x)ϕ(x)x).

Using the well-known inequality Φ(x)ϕ(x)/x for x > 0 (Feller 1957, Lemma VII.1.2.), we see that the expression in the preceding display is nonpositive for x > 0. □

Appendix C

Proof of Proposition 2

From Lee et al. (2016), we recall the formulas for the expressions on the right-hand side of (2), namely Am,s=(Am,s0,Am,s1) and bm,s=(bm,s0,bm,s1) with Am,s0 and bm,s0 given by1λ(Xmc(InPXm)Xmc(InPXm))and(ιXmcXm(XmXm)1sι+XmcXm(XmXm)1s),respectively, and with Am,s1=diag(s)(XmXm)1Xm and bm,s1=λdiag(s)(XmXm)1s (in the preceding display, PXm denotes the orthogonal projection matrix onto the column space spanned by Xm and ι denotes an appropriate vector of ones). Moreover, it is easy to see that the set {y:Am,sy<bm,s} can be written as {y: for z=(IpPηm)y, we have Vm,s(z)<ηmy<Vm,s+(z),Vm,s0(z)>0}, whereVm,s(z)=max({(bm,sAm,sz)i/(Am,scm)i:(Am,scm)i<0}{}),Vm,s+(z)=min({(bm,sAm,sz)i/(Am,scm)i:(Am,scm)i>0}{}),Vm,s0(z)=min({(bm,sAm,sz)i:(Am,scm)i=0}{})

with cm=ηm/||ηm||2; cf. also Lee et al. (2016).

Proof of Proposition 2.

Set I={i:(Am,scm)i<0} and I+={i:(Am,scm)i>0}. In view of the formulas of Vm,s(z) and Vm,s+(z) given earlier, it suffices to show that either I or I+ is nonempty. Conversely, assume that I=I+=. Then Am,scm=0 and hence also Am,s1cm=0. Using the explicit formula for Am,s1 and the definition of ηm, that is, ηm=Xm(XmXm)1γm, it follows that γm=0, which contradicts our assumption that γmR|m|{0}. □

Appendix D

Proof of Proposition 3 and Corollary 1

As a preparatory consideration, recall that Tm((InPηm)y) is the union of the intervals (Vm,s((InPηm)y),Vm,s+((InPηm)y)) with sSm+. Inspection of the explicit formulas for the interval endpoints given in Appendix C now immediately reveals the following: The lower endpoint Vm,s((InPηm)y) is either constant equal to on the set {y:Am,sy<bm,s}, or it is the minimum of a finite number of linear functions of y (and hence finite and continuous) on that set. Similarly the upper endpoint Vm,s+((InPηm)y) is either constant equal to on that set, or it is the maximum of a finite number of linear functions of y (and hence finite and continuous) on that set.

Proof of Proposition 3.

Let mM+{}. We first assume, for some s and y with sSm+ and Am,sy<bm,s, that the set in (6) is bounded from above (the case of boundedness from below is similar). Then there is an open neighborhood O of y, so that each point wO satisfies Am,sw<bm,s and also so that Tm((InPηm)w) is bounded from above. Because O has positive Lebesgue measure, (5) now follows from Proposition 1. To prove the converse, assume for each sSm+ and each y satisfying Am,sy<bm,s that Tm((InPηm)y) is unbounded from above and from below. Because the sets {y:Am,sy<bm,s} for sSm+ are disjoint by construction, the same is true for the sets Tm,s((InPηm)y) for sSm+. Using Proposition 1, we then obtain that Um̂(Y)Lm̂(Y) is bounded by a linear function ofmax{Vm,s((InPηm)Y):sSm+}min{Vm,s+((InPηm)Y):sSm+}

Lebesgue-almost everywhere on the event {m̂=m}. (The maximum and the minimum in the preceding display correspond to aK and b1, respectively, in Proposition 1.) It remains to show that the expression in the preceding display has finite conditional expectation on the event {m̂=m}. But this expression is the maximum of a finite number of Gaussians minus the minimum of a finite number of Gaussians. Its unconditional expectation, and hence also its conditional expectation on the event {m̂=m}, is finite. □

Proof of Corollary 1.

The statement for |m|=0 is trivial. Next, consider the case where |m|=1. Take sSm+ and y so that Am,sy<bm,s. We need to show that Tm(z)=Tm,1(z)Tm,1(z) is unbounded above and below for z=(InPηm)y. To this end, first recall the formulas presented at the beginning of Appendix C. Together with the fact that, here, ηm=Xmγm/||Xm||20, these formulas entail that Am,10cm=Am,10cm=0 and that Am,11cm=Am,11cm0. With this, and in view of the definitions of Vm,s(z), Vm,s+(z) and Vm,s0(z) in Appendix C, it follows that Tm(z) is a set of the form (,a)(a,), which is unbounded.

Finally, assume that |m|=pn. Fix sSm+ and y so that Am,sy<bm,s, and set z=(InPηm)y. Again, we need to show that Tm(z)=s˜Sm+Tm,s˜(z) is unbounded above and below. For each s˜Sm+, it is easy to see that Am,s˜0cm=0 and that bm,s˜0 is a vector of ones. The condition Am,s˜y<bm,s˜ hence reduces to Am,s˜1y<bm,s˜1. Note that Am,s˜1cm=diag(s˜)(XmXm)1γm/γm(XmXm)1γm, and that the set of its zero-components does not depend on s˜. We henceforth assume that γm is such that all components of Am,s1cm are nonzero, which is satisfied for Lebesgue-almost all vectors γm. Now choose sign-vectors s+ and s in {1,1}p as follows: Set si+=1 if (Am,s1cm)i<0; otherwise, set si+=si. With this, we get that Am,s+cm is a nonzero vector with positive components. Choose s in a similar fashion, so that Am,scm is a nonzero vector with negative components. It follows that Tm,s+(z)Tm,s(z) is a set of the form (,a)(a,). We next show that s+ and s lie in Sm+. Choose y+ so that (InPηm)y+=z and so that ηmy+Tm,s+(z). Because Vm,s+0(z)= by construction, it follows that Am,s+y+<bm,s+ and hence m̂(y+)=m and ŝ(y+)=s+. Because the same is true for all points in a sufficiently small open ball around y+, the event {m̂=m,ŝ=s+} has positive probability and hence s+Sm+. A similar argument entails that sSm+. Taken together, we see that Tm,s+(z)Tm,s(z)s˜Sm+Tm,s˜(z)=Tm(z), so that the last set is indeed unbounded above and below. □

Remark D.1.

The statement in Corollary 1 for the case |m|=pn does not hold for all γm or, equivalently, for all ηm. For example, if γm is such that ηm is parallel to one of the columns of X, then some components of Am,s1cm are zero and Tm((InPηm)y) can be bounded for some y. Figure D1 illustrates the situation.

Appendix E

Proof of Proposition 4

We only consider the case where b=supT<; the case where a=infT> is treated similarly. The proof relies on the observation thatlimwbU(w)L(w)A(w)=1for A(w)=ς2log((2α)/α)bw in view of Lemma A.4. The quantiles of A(W) are easy to compute: If sθ,ς2(κ) denotes the κ-quantile of A(W), thensθ,ς2(κ)=ς2log((2α)/α)bFθ,ς2T1(κ).

The denominator in the preceding display, which involves the inverse of Fθ,ς2T(·), can be approximated as follows.

Lemma

E.1. For κ1, we havebFθ,ς2T1(κ)=(1κ)ςP(VT)ϕ((bθ)/ς)(1+o(1)),where VN(θ,ς2).

Proof.

With the convention that Fθ,ς2T1(1)=b and as κ1, we havebFθ,ς2T1(κ)=Fθ,ς2T1(1)Fθ,ς2T1(κ)=(1κ)Fθ,ς2T1(1(1κ))Fθ,ς2T1(1)(1κ)=(1κ)((Fθ,ς2T1)(1)+o(1))=(1κ)(Fθ,ς2T1)(1)(1+o(1))=(1κ)1(Fθ,ς2T)(b)(1+o(1))=(1κ)P(VT)ς1ϕ((bθ)/ς)(1+o(1)),where the second-to-last equality relies on the inverse function theorem and the last equality holds because Fθ,ς2T(w)=P(V w|V T). □

Lemma

E.2. The κ-quantiles of A(W) provide an asymptotic lower bound for the κ-quantiles of the length U(W)L(W), in the sense that limsupκ1sθ,ς2(κ)/qθ,ς2(κ)1.

Proof.

Fix ϵ(0,1) and choose δ>0 so that (1ϵ)A(w)U(w)L(w) whenever w(bδ,b). In addition, we may assume that δ is sufficiently small so that the cdf of W is strictly increasing on (bδ,b). Using the formula for A(W), we get that{A(W)>x}={W>bς2log((2α)/α)x}for x > 0. By Lemma E.1, sθ,ς2(κ) converges to infinity as κ1. Hence, we have ς2log((2α)/α)sθ,ς2(κ)<δ/2 for κ sufficiently close to 1, say, κ(κ0,1). For each ρ(1/2,1) and κ(κ0,1), we obtain that{A(W)>ρsθ,ς2(κ)}={A(W)>ρsθ,ς2(κ),W>bδ}{U(W)L(W)>(1ϵ)ρsθ,ς2(κ)},

which entails thatP(U(W)L(W)(1ϵ)ρsθ,ς2(κ))<κbecause the cdf of W strictly increases from ρsθ,ς2(κ) to sθ,ς2(κ). It follows that (1ϵ)ρsθ,ς2(κ)qθ,ς2(κ) whenever ρ(1/2,1) and κ(κ0,1). Letting ρ go to 1 gives (1ϵ)sθ,ς2(κ)qθ,ς2(κ) whenever κ(κ0,1). Hence, limsupκ1(1ϵ)sθ,ς2(κ)/qθ,ς2(κ)1. Since ϵ can be chosen arbitrarily close to zero, this completes the proof. □

Proof of Proposition 4.

Use the formula for sθ,ς2(κ) and Lemma E.1 to obtain thatsθ,ς2(κ)qθ,ς2(κ)=1qθ,ς2(κ)ςlog((2α)/α)1κϕ((bθ)/ς)P(VT)(1+o(1))1qθ,ς2(κ)ςlog((2α)/α)1κϕ((bθ)/ς)Φ((bθ)/ς)(1+o(1))=rθ,ς2(κ)qθ,ς2(κ)(1+o(1))

as κ1 (the inequality holds because T(,b)). The claim now follows from this and Lemma E.2. □

Remark.

The argument presented here can be extended to obtain an explicit formula for the exact rate of qθ,ς2(κ) as κ1 or as θ (or both). The resulting expression is more involved (the cases where T is bounded from one side and from both sides need separate treatment) but qualitatively similar to rθ,ς2(κ), as far as its behavior for κ1 or θ is concerned. In view of this and for the sake of brevity, results for the exact rate are not presented here.