Abstract
Model-X knockoffs is a wrapper that transforms essentially any feature importance measure into a variable selection algorithm, which discovers true effects while rigorously controlling the expected fraction of false positives. A frequently discussed challenge to apply this method is to construct knockoff variables, which are synthetic variables obeying a crucial exchangeability property with the explanatory variables under study. This article introduces techniques for knockoff generation in great generality: we provide a sequential characterization of all possible knockoff distributions, which leads to a Metropolis–Hastings formulation of an exact knockoff sampler. We further show how to use conditional independence structure to speed up computations. Combining these two threads, we introduce an explicit set of sequential algorithms and empirically demonstrate their effectiveness. Our theoretical analysis proves that our algorithms achieve near-optimal computational complexity in certain cases. The techniques we develop are sufficiently rich to enable knockoff sampling in challenging models including cases where the covariates are continuous and heavy-tailed, and follow a graphical model such as the Ising model. Supplementary materials for this article are available online.
Supplementary Materials
Appendix: Appendix of the article. (.pdf file)
Division of Mathematical Sciences;
Acknowledgments
S. B. and E.C. would like to thank Yaniv Romano and Matteo Sesia for useful comments on an early version of this work. L. J. and W. W. would like to thank Jun Liu for fruitful discussions on MCMC.
Notes
1 See Appendix F.6 for a simulation study demonstrating the relationship between power and absolute correlation.
2 In the presence of a response Y, we also require , which is easily satisfied by procedures that generate from X without looking at Y.
3 We use to denote the conditional distribution of W 1 given W 2. We use subscript to mean the vector with the kth coordinate removed, and 1:k to mean the first k coordinates of the vector. We use the subscript 1:0 to mean an empty vector.
4 More generally, we take as acceptance probability with . In this work, γ is set to 1 as default, except in Section 3.3 and Appendix F.2, which are cases where tuning γ is recommended.
5 The proposal distribution at step j only depends on the conditional density of which can be easily shown to be symmetric in the first j – 1 pairs.
6 In this section, when not explicitly specified, a variable is set to its observed value, for example, is shorthand for .
7 This simple fact follows from the running intersection property; we refer the reader to Lemma 1 in Appendix A.
8 A chordal graph is a graph such that any cycle of length 4 or larger has a chord.
9 The hardware varies across simulations, but each CPU is between 2.5 GHz and 3.3 GHz.