Abstract
Consider the online testing of a stream of hypotheses where a real-time decision must be made before the next data point arrives. The error rate is required to be controlled at all decision points. Conventional simultaneous testing rules are no longer applicable due to the more stringent error constraints and absence of future data. Moreover, the online decision-making process may come to a halt when the total error budget, or alpha-wealth, is exhausted. This work develops a new class of structure-adaptive sequential testing (SAST) rules for online false discovery rate (FDR) control. A key element in our proposal is a new alpha-investing algorithm that precisely characterizes the gains and losses in sequential decision making. SAST captures time varying structures of the data stream, learns the optimal threshold adaptively in an ongoing manner and optimizes the alpha-wealth allocation across different time periods. We present theory and numerical results to show that SAST is asymptotically valid for online FDR control and achieves substantial power gain over existing online testing rules.
Supplementary material
The supplementary material contains the proofs of main theorems, other theoretical results and additional numerical results.
Notes
1 may be taken either as
on a growing domain or a set of points that lie on a fixed-domain regular grid:
with
.
2 As pointed out by a referee, (2) should be understood as the “average” FDR under the random mixture model (1); the expectation is taken over both
and
. Therefore, the “average” FDR is the correct understanding through which our theory may be properly conceptualized.
3 The asymptotic equivalence can be shown by following similar lines as done in Basu et al. (Citation2018) for proving the equivalence between the marginal FDR and FDR. Empirically the two power measures yield almost identical patterns in our simulations.
4 In situations where the empirical null is more appropriate (Efron Citation2004), f0 can be first estimated using the method in Jin and Cai (Citation2007) and then treated as known.
5 In situations where the online FDR analysis must start without prior data, we suggest applying existing methods such as LOND first and then switch to SAST as more data are acquired.
6 Let T be a prespecified large integer denoting the number of tests we tentatively plan to conduct. The bandwidth hT
will be determined by T and fixed throughout the entire period, which is allowed to go beyond T. The choice of hT
is discussed in detail in Section 4.1. We recommend using the same hT
in both EquationEquations (1)(1)
(1) and Equation(2)
(2)
(2) to stabilize the performance.
7 These prespecified regions serve as the proxies of the interested signals that we wish to discover.