Full article: Comment

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

1. INTRODUCTION

We are grateful for the opportunity to discuss this new test, based on marginal screening, of a global null hypothesis in linear models. Marginal screening has become a very popular tool for reducing dimensionality in recent years, and a great deal of work has focused on its variable selection properties (e.g., Fan and Lv Citation2008; Fan, Samworth, and Wu Citation2009). Corresponding inference procedures are much less well developed, and one of the interesting contributions of this article is the observation that the limiting distribution (here and throughout, we use the same notation as in the article) of $n^{1 / 2} ({\hat{θ}}_{n} - θ_{0})$ is discontinuous at θ₀ = 0. Such nonregular limiting distributions are well known to cause difficulties for the bootstrap (e.g., Beran Citation1997; Samworth Citation2003). Although in some settings, these issues are an artefact of the pointwise asymptotics of consistency usually invoked to justify the bootstrap (Samworth Citation2005), there are other settings where some modification of standard bootstrap procedures is required. Two such examples include bootstrapping Lasso estimators (Chatterjee and Lahiri Citation2011) and certain classification problems (Laber and Murphy Citation2011), where thresholded versions of the obvious estimators are bootstrapped, in an analogous fashion to the approach in this article.

2. STANDARDIZED OR UNSTANDARDIZED PREDICTORS?

Theorem 1 of the article reveals that the limiting distribution of $n^{1 / 2} ({\hat{θ}}_{n} - θ_{0})$ may be quite complicated, even under the global null. To see this, consider a setting where p = 2, where X₁ and X₂ are highly correlated, but $v ar (X_{1}) ≪ v ar (X_{2})$ . In this case, it is essentially a coin toss as to which predictor has the greater sample correlation with Y, but if ${\hat{k}}_{n} = 1$ then $| {\hat{θ}}_{n} |$ will be tend to be large, while if ${\hat{k}}_{n} = 2$ then $| {\hat{θ}}_{n} |$ will be tend to be small. The unfortunate consequence for the power of the procedure is that even for large sample sizes, we will only have a reasonable chance of rejecting the global null if we select X₁ (in particular, the power will be not much greater than 50% even when the signal is relatively large). For instance, consider the situation where n = 100, p = 2, X₁ ∼ N(0, 1), X₂ = 20X₁ + η, where η ∼ N(0, 1) is independent of X₁, and (1) $\begin{matrix} Y = X_{1} + ϵ, \end{matrix}$ (1) where ε ∼ N(0, 1) is independent of X₁ (and η). Instead of using adaptive resampling test (ART) to obtain the critical value for the test of size α = 0.05, we simply simulated from the null model where (X₁, X₂) are as above, but Y = ε ∼ N(0, 1). A density plot of the values of ${\hat{θ}}_{n}$ computed over 10,000 repetitions is shown in the top-left panel of ; note that the spike around 0 is due mainly to the 5017 occasions where X₂ happened to have higher absolute correlation with Y (i.e., ${\hat{k}}_{n} = 2$ ). The critical value for the test was taken to be the 100(1 − α)th quantile of the realizations of $| {\hat{θ}}_{n} |$ , namely, 0.171. Under the alternative specified by (Equation2.1(1) $\begin{matrix} Y = X_{1} + ϵ, \end{matrix}$ (1) ), ${\hat{θ}}_{n}$ has a highly bimodal distribution as illustrated in the bottom-left panel of . The only occasions when we were able to reject the null were when X₁ had higher absolute correlation with Y, yielding a power of 59.8%.

Fortunately, it is straightforward to construct a slightly modified test statistic that can yield great improvements. Indeed, it is standard practice in variable selection contexts to standardize each predictor X_k so that $\hat{E} (X_{k}) = 0$ and $\hat{v ar} (X_{k}) = n$ , and likewise to standardize the response so that $\hat{E} (Y) = 0$ and $\hat{v ar} (Y) = n$ . This amounts to using the test statistic $| {\tilde{θ}}_{n} |$ , where $\begin{matrix} {\tilde{θ}}_{n} = \hat{Corr} (X_{{\hat{k}}_{n}}, Y) . \end{matrix}$ Note that the definition of ${\tilde{θ}}_{n}$ does not depend on whether the predictors and the response have been standardized or not, and that we have the simple expression $\begin{matrix} | {\tilde{θ}}_{n} | = max_{j = 1, ..., p} | \hat{Corr} (X_{j}, Y) | . \end{matrix}$ For the example above, the top-right panel of gives a density plot of ${\tilde{θ}}_{n}$ under the null; the critical value for our modified test was 0.198. Under the alternative, ${\tilde{θ}}_{n}$ tends to be inflated, regardless of whether ${\hat{k}}_{n} = 1$ or ${\hat{k}}_{n} = 2$ ; in fact, we obtain an empirical power of 100%.

Figure 1 Top row: density plots of ${\hat{θ}}_{n}$ (left) and ${\tilde{θ}}_{n}$ (right) under the global null hypothesis for the example in Section 2. Bottom row: corresponding density plots of ${\hat{θ}}_{n}$ (left) and ${\tilde{θ}}_{n}$ (right) under the alternative specified in (Equation2.1(1) $\begin{matrix} Y = X_{1} + ϵ, \end{matrix}$ (1) ).

Figure 1 Top row: density plots of θ^n (left) and θ˜n (right) under the global null hypothesis for the example in Section 2. Bottom row: corresponding density plots of θ^n (left) and θ˜n (right) under the alternative specified in (Equation2.1(1) Y=X1+ϵ,(1) ).

We emphasize that the problems described in this section are not observed in the simulation study of the article because there all of the predictors have equal variance. In the next section, we consider predictors and response standardized as above, and consider alternative approaches to calibrate the test statistic $n^{1 / 2} | {\tilde{θ}}_{n} |$ , as well as another test statistic proposed in Goeman, van de Geer, and van Houwelingen (Citation2006).

3. ALTERNATIVE APPROACHES

Although the nonregularities in the problem considered here make the construction of a confidence interval for θ₀ a challenging task, the particularly simple form of the global null hypothesis makes the testing problem amenable to several other approaches. Under the global null, X and Y are independent, so by the central limit theorem, $\begin{matrix} n^{1 / 2} (\begin{matrix} \hat{Corr} (X_{1}, Y) \\ ⋮ \\ \hat{Corr} (X_{p}, Y) \end{matrix}) \overset{d}{\to} N_{p} (0, Θ), \end{matrix}$ as n → ∞, where Θ_jk = Corr(X_j, X_k). Then by the continuous mapping theorem, $\begin{matrix} n^{1 / 2} | {\tilde{θ}}_{n} | \overset{d}{\to} max_{j = 1, ..., p} | Z_{j} |, \end{matrix}$ where (Z₁, …, Z_p)^T ∼ N_p(0, Θ). Since the distribution on the right does not depend on the distribution of Y, we can simulate $n^{1 / 2} | {\tilde{θ}}_{n} |$ under the distribution of Y being (a) the empirical measure of the data Y₁, …, Y_n, or (b) N(0, 1), for example, to calibrate the test statistic. and display the results of using these approaches in the numerical experiments of Section 4.1 in the article. Method (a) appears to yield a test with size not exceeding its nominal level and with similar power to the ART procedure. When the error distribution is normal, the size of the test from method (b) is exactly equal to the nominal level, up to Monte Carlo error; again the power in similar to that of ART.

Figure 2 The same graphs as in (ρ = 0.5) of the original article but for globaltest (black circles), method (a) (green crosses), method (b) (red plus signs), and the permutation test (blue triangles). Note model (i) is the null model. (For interpretation of the references to color in this caption and that of Figure 3, the reader is referred to the web version of the article.)

Figure 2 The same graphs as in Figure 1 (ρ = 0.5) of the original article but for globaltest (black circles), method (a) (green crosses), method (b) (red plus signs), and the permutation test (blue triangles). Note model (i) is the null model. (For interpretation of the references to color in this caption and that of Figure 3, the reader is referred to the web version of the article.)

Figure 3 The same graphs as in (ρ = 0.8) of the original article but for globaltest (black circles), method (a) (green crosses), method (b) (red plus signs), and the permutation test (blue triangles).

Figure 3 The same graphs as in Figure 2 (ρ = 0.8) of the original article but for globaltest (black circles), method (a) (green crosses), method (b) (red plus signs), and the permutation test (blue triangles).

An alternative approach to calibration is via permutations. Making the dependence of ${\tilde{θ}}_{n}$ on Y₁, …, Y_n explicit, we note that the law of ${\tilde{θ}}_{n} (Y_{1}, ..., Y_{n})$ is the same as that of ${\tilde{θ}}_{n} (Y_{π (1)}, ..., Y_{π (n)})$ for any permutation π of {1, …, n}. The permutation test has the advantage over (a) and (b), of having its size not exceeding the nominal level regardless of the distribution of Y. Its power performance also seems close to that of ART.

Although it may seem natural to base test statistics on ${\tilde{θ}}_{n}$ , there are other possibilities. For example, Goeman, van de Geer, and van Houwelingen (Citation2006) constructed a locally most powerful test for high-dimensional alternatives under the global null. We compare the power of their globaltest procedure with the approaches discussed above, in and . Overall, its performance is similar to that of ART, though in certain settings it seems to have a slight advantage and in others a slight disadvantage.

4. EXTENSIONS

In our view, the main attraction of ART is that it can be used to construct confidence intervals for θ_n. It would be interesting to study empirically the coverage properties and lengths of these intervals. Another interesting related question would be to try to provide some form of uncertainty quantification for the variable having greatest absolute correlation with the response. The ideas of stability selection (Meinshausen and Bühlmann Citation2010; Shah and Samworth Citation2013) provide natural quantifications of variable importance through empirical selection probabilities over subsets of the data. However, it is not immediately clear how to use these to provide, say, a (nontrivial) confidence set of variable indices that with at least 1 − α probability contains all indices of variables having largest absolute correlation with the response (in particular this would be set full set {1, …, p} of indices under the global null).

Although understanding marginal relationships between variables and the response is useful in certain contexts, in other situations, the coefficients from multivariate regression are of more interest. It would be interesting to see whether the ART methodology can be extended to provide confidence intervals for the largest regression coefficients in absolute value.

Additional information

Notes on contributors

Rajen D. Shah

Rajen D. Shah (E-mail: [email protected]) and Richard J. Samworth (E-mail: [email protected]), Statistical Laboratory, University of Cambridge, Cambridge CB2 1TN, United Kingdom. The second author is supported by an Engineering and Physical Sciences Research Council Fellowship and a grant from the Leverhulme Trust.

Richard J. Samworth

REFERENCES

Beran, R.J. (1997), “Diagnosing Bootstrap Success,” Annals of the Institute of Statistical Mathematics, 4, 1–24.
Web of Science ®Google Scholar
Chatterjee, A., and Lahiri, S.N. (2011), “Bootstrapping Lasso Estimators,” Journal of the American Statistical Association, 106, 608–625.
Web of Science ®Google Scholar
Fan, J., and Lv, J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space” (with discussion), Journal of the Royal Statistical Society, Series B, 70, 849–912.
Google Scholar
Fan, J., Samworth, R., and Wu, Y. (2009), “Ultrahigh Dimensional Feature Selection: Beyond the Linear Model,” Journal of Machine Learning Research, 10, 2013–2038.
PubMed Web of Science ®Google Scholar
Goeman, J.J., van de Geer, S.A., and van Houwelingen, H.C. (2006), “Testing Against a High Dimensional Alternative,” Journal of the Royal Statistical Society, 68, 477–493.
Google Scholar
Laber, E., and Murphy, S.A. (2011), “Adaptive Confidence Intervals for the Test Error in Classification” (with discussion), Journal of the American Statistical Association, 106, 904–913.
PubMed Web of Science ®Google Scholar
Meinshausen, N., and Bühlmann, P. (2010), “Stability Selection” (with discussion), Journal of the Royal Statistical Society, Series B, 72, 417–473.
Google Scholar
Samworth, R. (2003), “A Note on Methods of Restoring Consistency to the Bootstrap,” Biometrika, 90, 985–990.
Web of Science ®Google Scholar
——— (2005), “Small Confidence Sets for the Mean of a Spherically Symmetric Distribution,” Journal of the Royal Statistical Society, Series B, 67, 343–361.
Google Scholar
Shah, R.D., and Samworth, R.J. (2013), “Variable Selection With Error Control: Another Look at Stability Selection,” Journal of the Royal Statistical Society, Series B, 75, 55–80.
Google Scholar

Comment

1. INTRODUCTION

2. STANDARDIZED OR UNSTANDARDIZED PREDICTORS?

3. ALTERNATIVE APPROACHES

4. EXTENSIONS

Notes on contributors

Rajen D. Shah

Richard J. Samworth

REFERENCES

Information for

Open access

Opportunities

Help and information

Comment

1. INTRODUCTION

2. STANDARDIZED OR UNSTANDARDIZED PREDICTORS?

3. ALTERNATIVE APPROACHES

4. EXTENSIONS

Additional information

Notes on contributors

Rajen D. Shah

Richard J. Samworth

REFERENCES

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date