Full article: False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-values

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The effective utilization of structural information in data while ensuring statistical validity poses a significant challenge in false discovery rate (FDR) analyses. Conformal inference provides rigorous theory for grounding complex machine learning methods without relying on strong assumptions or highly idealized models. However, existing conformal methods have limitations in handling structured multiple testing, as their validity often requires the deployment of symmetric decision rules, which assume the exchangeability of data points and permutation-invariance of fitting algorithms. To overcome these limitations, we introduce the pseudo local index of significance (PLIS) procedure, which is capable of accommodating asymmetric rules and requires only pairwise exchangeability between the null conformity scores. We demonstrate that PLIS offers finite-sample guarantees in FDR control and the ability to assign higher weights to relevant data points. Numerical results confirm the effectiveness and robustness of PLIS and demonstrate improvements in power compared to existing model-free methods in various scenarios. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Keywords:

1 Introduction

In various applications, data is commonly acquired and organized as ordered sequences or lattices, revealing informative structural patterns that play a crucial role in analysis and interpretation. For example, spatio-temporal data in econometric analyses may display serial or spatial dependence structures, while in genome-wide association studies, single-nucleotide polymorphisms (SNPs) often cluster along biological pathways, indicating functional relationships between genes. To make reliable and meaningful inferences, it is essential to account for the underlying structural patterns in such data. False discovery rate (FDR) methods (Benjamini and Hochberg Citation1995) are a powerful tool for identifying sparse signals from massive and complex data. In FDR analyses, a significant challenge lies in integrating structural information to enhance statistical power and interpretability, while simultaneously ensuring the validity of FDR control. In this section, we discuss recent advancements in addressing this challenge, as well as the limitations of existing approaches, and present our contributions.

1.1 Model-Based and Model-Free FDR Methods

Structural knowledge, such as the clustering patterns and dependence, can be leveraged to enhance the efficiency of existing FDR methods, as demonstrated by the works of Benjamini and Heller (Citation2007), Sun and Cai (2009), Fan, Han, and Gu (Citation2012), Sun et al. (Citation2015), Rebafka, Roquain, and Villers (Citation2022), and Perrot-Dockès et al. (Citation2023). However, the validity of existing “model-based” FDR methods, which involve constructing new test statistics based on estimated model parameters, typically depend on idealized assumptions, such as correct specification of the data generating models, homogeneity and stationarity of the underlying processes, consistent estimation of unknown parameters, and asymptotic normality of the test statistics. These assumptions may not be fulfilled and may be hard to check in practice. The violation of these assumptions can have serious consequences in FDR analysis, including decreased power and inflated error rate. It remains a significant challenge to develop powerful FDR methods for structured multiple testing that are assumption-lean and provably valid.

The framework of conformal inference (Vovk, Gammerman, and Shafer Citation2005) provides finite-sample uncertainty guarantees on a flexible class of off-the-shelf machine learning algorithms, only under the assumption of exchangeability. The important connection between machine learning and FDR analysis, as highlighted by Yang et al. (Citation2021), Mary and Roquain (Citation2022), and Bates et al. (Citation2023), indicates that efficient conformity scores, and hence provably valid and powerful conformal p-values, can be constructed based on complex learning algorithms. This insightful perspective leads to “model-free” approaches to FDR control, a research direction with considerable promise. Recent developments along this direction include the BONuS (Yang et al. Citation2021) and AdaDetect (Marandon et al. Citation2024) procedures, which convincingly demonstrate that combining conformal p-values with the Benjamini-Hochberg (BH) algorithm enables the use of highly effective machine learning models while ensuring finite-sample FDR control, even in the presence of model misspecification. However, currently available conformal methods are not well-equipped to handle structured multiple testing. The limitation is illustrated next.

1.2 Non-Exchangeable Scores and Asymmetric Decision Rules

Suppose we are interested in testing m null hypotheses $H_{0, i}$ , for $1 \leq i \leq m$ , based on summary statistics $X = (X_{i} : 1 \leq i \leq m)$ . Denote $Θ = (θ_{i} : 1 \leq i \leq m) \in {0, 1}^{m}$ the true states of nature, where $θ_{i} = 0 / 1$ indicates that $H_{0, i}$ is true/false. Let $H_{0} = {i \in [m] : H_{0, i} is true}$ . The decisions are represented by a binary vector $δ (X) = (δ_{i} : 1 \leq i \leq m) \in {0, 1}^{m}$ , where $δ_{i} = 1$ indicates that $H_{0, i}$ is rejected and $δ_{i} = 0$ otherwise. We allow $δ_{i}$ to depend on the entire dataset $X$ . We call $δ$ a symmetric decision rule (Copas Citation1974) if $δ {Π (X)} = Π {δ (X)}$ for all permutation operators $Π$ .

To provide context, consider a hidden Markov model (HMM) where the underlying states $Θ$ are unknown, and the observations are conditionally independent given $Θ$ . In this toy example, we assume that the hidden states form a binary Markov chain and the non-null cases tend to appear in clusters. We further assume that the null and non-null distributions are respectively given by $(X_{i} | θ_{i} = 0) \sim N (0, 1)$ and $(X_{i} | θ_{i} = 1) \sim N (2, 1) .$ Suppose we have observed $X_{j} = X_{k} = 2.5$ , $j \neq k$ , where $X_{j}$ is surrounded by small observations and $X_{k}$ is surrounded by large observations. Intuitively, $X_{k}$ is more likely to be a non-null case compared to $X_{j}$ due to the clustering effects from the Markovian structure; hence, symmetric rules are inappropriate. Sun and Cai (2009) demonstrated that the optimal FDR procedure in HMMs is an asymmetric rule that relies on thresholding the local index of significance (LIS), whose value depends on the entire sequence, with higher weights given to neighboring locations.

The presence of structured patterns within the data and the use of asymmetric rules pose a significant challenge to the conformal inference framework. To see this, we tentatively consider the following definition, which generalizes the conformal p-value in Bates et al. (Citation2023) in a nuanced (and possibly improper) way: (1) $\begin{matrix} \begin{matrix} p_{i} \equiv p_{i} (X_{i}) & = \frac{1}{1 + | D^{cal} |} [1 + \sum_{j \in D^{cal}} I {s_{i} (X_{i}) > s_{j} (Y_{j})}], \\ i \in [m] = {1, \dots, m} . \end{matrix} \end{matrix}$ (1)

Here $s_{i} (\cdot)$ is the conformity score function that is allowed to vary across i, $D^{cal}$ is the index set for calibration data containing null samples $Y_{j}$ , with $| \cdot |$ denoting the cardinality of a set. Bates et al. (Citation2023) showed that if the score functions are identical and permutation-invariant, and the data points ${X_{i}, i \in H_{0}; Y_{j}, j \in D^{cal}}$ are jointly exchangeable, then the conformal p-values in (1) are super-uniform and PRDS.Footnote¹ Hence, according to Benjamini and Yekutieli (Citation2001) and Sarkar (Citation2002), applying BH with these conformal p-values is valid for FDR control. However, score functions that leverage structural patterns are typically not permutation-invariant, leading to a violation of the joint exchangeability assumption. Consequently, the conformal p-values defined by (1) may become improper and compromise the PRDS property, making it problematic to apply the BH procedure.

1.3 A Preview of Our Method and Contributions

To develop a provably valid FDR procedure that can effectively leverage structural information and accommodate asymmetric rules, we propose the pseudo local index of significance (PLIS) procedure that consists of three steps:

constructing baseline data that preserve useful structural patterns;
calculating conformity scores based on user-specified working models;
constructing a mirror process that emulates the true process for decision-making.

The proposed algorithm features two critical components: an innovative algorithm for calculating the conformity score through a working model, which captures useful structural patterns and enhances the efficiency of the FDR analysis; a generic framework for constructing a mirror process for decision-making, which bypasses the p-value inference framework, eliminates the need for joint exchangeability assumptions, and establishes finite-sample FDR theory for asymmetric rules.

The proposed research makes several contributions. First, PLIS provides a provably valid model-free approach to multiple testing. Compared to model-based FDR procedures, it achieves comparable power when the underlying model is correctly specified and accurately estimated, while maintaining finite-sample FDR control in scenarios where the model is mis-specified or estimated poorly. Second, PLIS demonstrates superior power compared to existing conformal methods due to its ability to leverage asymmetric rules that exploit structural information. Finally, we have developed a novel theoretical framework that relies only on the pairwise exchangeability between $s_{i} (X_{i})$ and $s_{i} (Y_{i})$ under the null, which is much weaker than the joint exchangeability requirement stipulated in existing theories for conformal inference.

1.4 Organization

The article is organized as follows. Section 2 introduces the PLIS procedure for structured multiple testing and establishes its theoretical properties. Section 3 explores extensions of PLIS and its relationship to existing concepts. Section 4 presents simulation results to assess the numerical performance of PLIS and to compare it with existing methods. An illustration of the proposed method is provided in Section 5 through an application. The supplementary material contains proofs, extensions and additional numerical results.

2 Structured Multiple Testing: Conformal Inference with Asymmetric Rules

In this section, we first present the structured probabilistic model (Section 2.1) and the corresponding problem formulation (Section 2.2). Next, we introduce the PLIS procedure (Section 2.3) and establish its theoretical properties (Section 2.4). Concrete examples and guidelines are provided in Section 2.5 to illustrate the PLIS framework. Sections 2.6 and 2.7, respectively discuss the semi-supervised PLIS algorithm and the conformal q-value notion.

2.1 A Class of Structured Probabilistic Models

Consider a multiple testing problem with binary-valued unknown states $Θ = (θ_{i} : i \in G)$ that form a graph $G$ . Let $m = | G |$ be the number of nodes/hypotheses, where $θ_{i} = 0$ indicates that node i is a null case, and $θ_{i} = 1$ otherwise. Our study focuses on a class of structured probabilistic models (e.g., Goodfellow, Bengio, and Courville Citation2016), where the correlations between random variables $X = (X_{i} : i \in G)$ are captured by the interdependence structure between corresponding latent states $Θ = (θ_{i} : i \in G)$ . The inference units may be conceptualized as the nodes of the graph, and the interdependence structures between test statistics are encoded as edges that connect the nodes. The observations $(X_{i} : i \in G)$ are conditionally independent given $Θ$ , obeying (2) $P (X, Θ) = P (Θ) \prod_{i \in G} f (X_{i} | θ_{i}), X_{i} | θ_{i} \sim (1 - θ_{i}) f_{0} (x) + θ_{i} f_{1 i} (x),$ (2) where $f_{0}$ and $f_{1 i}$ are the null and non-null densities, respectively. Although the model assumes conditional independence, it still remains highly flexible, as we do not impose any distributional constraints on the unknown states $Θ$ , and allow $f_{1 i}$ to vary across different inference units. Conditional independence is closely related to the widely employed exchangeability assumption in the conformal inference literature (Schervish Citation2012; Barber et al. Citation2023). As established by de Finetti’s Theorem (Heath and Sudderth Citation1976; Diaconis and Freedman Citation1980; Durrett Citation2019), if a set of random elements is exchangeable, there must exist a latent $η$ such that the random elements are iid conditional on $η$ . For a more detailed and in-depth exploration of the generalization of model (2) and the exchangeability assumption, please refer to Section 2.6 and Section C of the supplementary material.

We make two assumptions without loss of generality: (a) $f_{0} (- x) = f_{0} (x)$ ; and (b) a larger magnitude of $| X_{i} |$ provides stronger evidence against the null hypothesis. In situations where the assumptions do not hold, we may transform $X_{i}$ into z-values using the formula $z_{i} = Φ^{- 1} {F_{0} (X_{i})}$ , where $F_{0}$ represents the cumulative distribution function (CDF) of $X_{i}$ under the null, and $Φ$ represents the CDF of a standard Gaussian variable.

We discuss several special cases of Model (2). First, we examine the scenario where the latent states $θ_{i}$ are iid and the non-null densities $f_{1 i}$ are identical (and hence denoted as $f_{1}$ ). Under this assumption, Model (2) reduces to Efron’s two-group model (Efron et al. Citation2001): (3) $X_{i} \overset{iid}{\sim} (1 - π) f_{0} (x) + π f_{1} (x),$ (3) where $π = P (θ_{i} = 1)$ is the proportion of non-null cases. While the independent case is not the primary focus of this work, it is worth noting that our proposed methodology remains applicable in this scenario. In Section 2.5, we present a tailored version of our proposal for Model (3), and discuss its connection to the AdaDetect procedure (Marandon et al. Citation2024).

When the latent states $Θ$ form a homogeneous binary Markov chain, a two-state hidden Markov model (HMM, Rabiner Citation1989) can be recovered from (2). HMMs have been extensively studied in the context of multiple testing (e.g., Sun and Cai 2009; Sesia, Sabatti, and Candès Citation2018; Perrot-Dockès et al. Citation2023). By properly accounting for the dependence between test statistics and leveraging the HMM structure, these methods can effectively control the FDR and provide more accurate inference. Section 2.5 presents a customized algorithm for HMMs based on our proposal and illustrates its superiority over existing methods.

The class of models in (2) also includes useful models such as the Ising model (Onsager Citation1944) and conditional random field model (Lafferty, McCallum, and Pereira Citation2001), which have been applied in multiple testing for climate change analysis and network analysis (Liang and Nettleton Citation2010; Shu, Nan, and Koeppe Citation2015; Liu, Zhang, and Page Citation2016; Rebafka, Roquain, and Villers Citation2022). In contrast to HMMs which assume homogeneous transition probabilities and identical emission probabilities, model (2) offers greater flexibility to accommodate deviations from these assumptions. Moreover, in contrast to the stationary hidden states model proposed by Wu (Citation2008), Model (2) allows the underlying process for $Θ$ to be nonstationary. As many real-world phenomena exhibit nonstationary behaviors, Model (2) is better equipped to provide a more accurate representation of the underlying dynamics.

2.2 Problem Formulation and Basic Setup

We consider a multiple testing problem that arises from Model (2), which can be complex in nature and difficult to learn accurately from data. The hypotheses of interest are $H_{0, i} : θ_{i} = 0 versus H_{1, i} : θ_{i} = 1,$ for $i \in G$ . A key feature of the models in (2) is the exchangeability of null observations $(X_{i} : i \in G, θ_{i} = 0)$ , which remains true regardless of the model complexity. This important characteristic provides a strong foundation for conformal inference and has been widely assumed. If we make the additional assumption that our working model has learned a score function that is permutation-invariant, then we can properly define the conformal p-values in (1) for FDR analysis.

As previously indicated, we want to employ a score tailored specifically to index i. The score does not require permutation invariance regarding i. This attribute provides a substantial degree of flexibility, enabling the construction of powerful scores that can effectively leverage useful structural patterns. However, the violation of permutation invariance poses a significant challenge to the conformal inference framework, as the p-value is no longer super-uniform under the null. To address this issue, Section 2.3 develops a new framework that eliminates the need of defining p-values as required by existing methods.

In conformal inference, the calibration dataset, denoted as $Y$ , plays a key role in assessing the relative significance of hypotheses. Various approaches can be employed to obtain the calibration data $Y$ . In the classical setting of multiple testing, $Y$ can be sampled from the known null distribution $F_{0}$ . Conversely, in certain machine learning applications, such as the outlier detection problem, $Y$ may be sampled from a set of labeled null observations. For the present, we assume that the null distribution $F_{0}$ is known. In Section 2.6, we turn to scenarios where $F_{0}$ is unknown, yet labeled null samples are available.

Let $s_{i}^{X} = s_{i} (X_{i})$ denote the test score of unit i. The superscript X is used to distinguish $s_{i}^{X}$ from the calibration score, denoted as $s_{i}^{Y}$ , which are computed based on the calibration data. Let $S_{X} = {s_{i}^{X} : i \in G}$ and $S_{Y} = {s_{i}^{Y} : i \in G}$ . In constructing the scores, it is crucial to guarantee the pairwise exchangeability between $s_{i}^{X}$ and $s_{i}^{Y}$ for $i \in H_{0}$ . This important notion is rigorously defined in Section 2.4.

We denote the decision for test unit i as $δ_{i} \in {0, 1}$ , where $δ_{i} = 1$ signifies the rejection of $H_{0, i}$ and $δ_{i} = 0$ otherwise. Denote $δ = (δ_{i} : 1 \leq i \leq m)$ . The false discovery proportion (FDP) and true discovery proportion (TDP) are respectively defined as (4) $FDP (δ) = \frac{\sum_{j \in G} (1 - θ_{j}) δ_{j}}{(\sum_{j \in G} δ_{j}) \lor 1}, TDP (δ) = \frac{\sum_{j \in G} θ_{j} δ_{j}}{(\sum_{j \in G} θ_{j}) \lor 1} .$ (4)

The (frequentist) FDR and average power (AP) can be defined accordingly: $FDR = E [FDP (δ)] and AP = E [TDP (δ)],$ where the expectation operator is taken over the observed and calibration data while holding the hidden states $Θ$ fixed. The objective is to develop a decision rule that effectively controls the FDR while maximizing the AP.

2.3 The PLIS Procedure for FDR Control

The conformity score that best captures structural patterns from probabilistic models (2) is $P (θ_{i} = 0 | X)$ , assuming perfect knowledge of the model structure. However, accurately estimating this score is challenging. To address this, we propose to compute a score through a user-specified working model that efficiently extracts structural information and predicts unknown states. Specifically, we choose a working model $M$ and corresponding algorithm $A$ to compute pseudo scores $S_{X} = {s_{i}^{X} : i \in G}$ , bearing in mind that the working model may deviate from the complex data-generating model.

To effectively leverage the structural information, the size of the calibration set needs to match that of the test set, ensuring that each test unit is paired with a corresponding calibration data point. Suppose the test and calibration data points have been paired and denoted as ${(X_{i}, Y_{i}), i \in G}$ . In our methodology, the initial step involves constructing the baseline dataset, denoted as $W = (W_{i} \equiv h (X_{i}, Y_{i}) : i \in G)$ , where (5) $\begin{matrix} h (x, y) = {\begin{matrix} x & if | x | \geq | y | \\ y & otherwise \end{matrix} . \end{matrix}$ (5)

The construction of $W$ ensures that $W_{i}$ depends equally on $X_{i}$ and $Y_{i}$ , while satisfying the flipping-coin property, that is, $P (W_{i} = X_{i}) = P (W_{i} = Y_{i}) = 1 / 2$ if $i \in H_{0}$ .

Remark 1

Any symmetric function satisfying $h (x, y) = h (y, x)$ can be employed to generate the baseline data $W$ . However, our choice to use (5) enables us to retain data with large effect sizes, thereby preserving the structural patterns among the non-null cases. This is particularly advantageous for datasets that display local dependency structures. Section E.7 of the supplement shows that (5) leads to substantial power gain compared to alternative choices such as $h^{'} (x, y) = x + y$ .

We proceed to generate two parallel datasets ${{\tilde{X}}_{i} : i \in G}$ and ${{\tilde{Y}}_{i} : i \in G}$ by respectively substituting $X_{i}$ and $Y_{i}$ for $W_{i}$ in $W$ . For example, if $X$ is an ordered sequence $(X_{1}, \dots, X_{m})$ , then ${\tilde{X}}_{i} = (W_{1}, \dots, W_{i - 1}, X_{i}, W_{i + 1}, \dots, W_{m})$ . Next, we calculate the pseudo score $s_{i}^{X} \equiv P_{M, A} (θ_{i} = 0 | {\tilde{X}}_{i})$ , referred to as the pseudo local index of significance (PLIS). The corresponding calibration score is given by $s_{i}^{Y} \equiv P_{M, A} (θ_{i} = 0 | {\tilde{Y}}_{i})$ , applying the same working model $M$ and computational algorithm $A$ to the calibrated data ${\tilde{Y}}_{i}$ . visually illustrates the process of data generation and score computation, which guarantees the pairwise exchangeability between $s_{i}^{X}$ and $s_{i}^{Y}$ , a key property that will be rigorously defined and established in Section 2.4.

Figure 1: A graphical illustration of constructing the baseline data and calculating the scores. To create ${\tilde{X}}_{i}$ and ${\tilde{Y}}_{i}$ , each $W_{i}$ in $W$ is replaced by $X_{i}$ and $Y_{i}$ in turn, for $i \in G$ , with the values in remaining nodes unchanged. The same function $s_{k} (\cdot)$ is used for computing $s_{k}^{X}$ and $s_{k}^{Y}$ , ensuring that the two scores are pairwise exchangeable if $k \in H_{0}$ .

Figure 1: A graphical illustration of constructing the baseline data and calculating the scores. To create X˜i and Y˜i, each Wi in W is replaced by Xi and Yi in turn, for i∈G, with the values in remaining nodes unchanged. The same function sk(·) is used for computing skX and skY, ensuring that the two scores are pairwise exchangeable if k∈H0.

Suppose that a smaller value of $s_{i}^{X}$ indicates stronger evidence against the null hypothesis. Intuitively, the rejection of $H_{0, i}$ requires two pieces of evidence: first, $s_{i}^{X}$ must be sufficiently small, and second, $s_{i}^{X}$ should be smaller than its null counterpart $s_{i}^{Y}$ . Therefore, we only select small $s_{i}^{X}$ from the candidate rejection set $G^{r} = {i \in G : s_{i}^{X} < s_{i}^{Y}}$ . Consider a class of thresholding rules $δ = (δ_{i} : i \in G)$ , where $δ_{i} \equiv 0$ for $i \notin G^{r}$ , and $δ_{i} = I (s_{i}^{X} \leq t)$ for $i \in G^{r}$ . Correspondingly, we introduce the calibration set $G^{c} = {i \in G : s_{i}^{Y} < s_{i}^{X}}$ , which mirrors $G^{r}$ . This mirror set of null scores ${s_{i}^{Y} : i \in G^{c}}$ allows us to characterize the joint behavior of their counterparts ${s_{i}^{X} : i \in G^{r}}$ under the null. It follows that the true FDP process and its mirror process, denoted $Q (t)$ , are respectively given by (6) $\begin{matrix} FDP (t) & = \frac{\sum_{j \in G^{r} \cap H_{0}} I {s_{j}^{X} \leq t}}{(\sum_{j \in G^{r}} I {s_{j}^{X} \leq t}) \lor 1}, \\ Q (t) & = \frac{1 + \sum_{j \in G^{c}} I {s_{j}^{Y} \leq t}}{(\sum_{j \in G^{r}} I {s_{j}^{X} \leq t}) \lor 1}, t > 0. \end{matrix}$ (6)

To gain a better understanding of our methodology, it is helpful to note that $Q (t)$ provides a conservative estimate of $FDP (t)$ . This conservativeness arises from two factors: (a) $\sum_{j \in G^{c}} I {s_{j}^{Y} \leq t}$ is always greater than or equal to $\sum_{j \in G^{c} \cap H_{0}} I {s_{j}^{Y} \leq t}$ ; and (b) $\sum_{j \in G^{c} \cap H_{0}} I {s_{j}^{Y} \leq t}$ can be seen as a mirror process that resembles the unobserved process $\sum_{j \in G^{r} \cap H_{0}} I {s_{j}^{X} \leq t}$ since, as will be established in the next section (Property 3), $s_{j}^{X}$ and $s_{j}^{Y}$ are pairwise exchangeable for $j \in H_{0}$ . We choose the largest threshold, denoted as $τ$ , such that the conservative estimate of the FDP falls below $α$ : $τ = sup {t \in S_{X} \cup S_{Y} : Q (t) \leq α} .$

The proposed PLIS procedure, summarized in Algorithm 1, rejects $H_{0, i}$ if $δ_{i} \equiv I (i \in G^{r}, s_{i}^{X} \leq τ) = 1$ . By mathematical conventions, the supremum of an empty set is defined as $- \infty$ . Therefore, if the set ${t \in S_{X} \cup S_{Y} : Q (t) \leq α}$ is empty, then $τ$ is set to $- \infty$ , and Algorithm 1 does not reject any hypotheses.

Algorithm 1

The PLIS procedure

Input: The observations $X = (X_{i} : i \in G)$ , the null distribution $F_{0}$ , a pre-specified FDR level $α$ , a working model $M$ and a computational algorithm $A$ .

Output: A decision rule $δ = (δ_{i} : i \in G)$ .

1: Generate $Y = (Y_{i} : i \in G)$ , where $Y_{i}$ are iid observations from $F_{0}$ .

2: Create $W = (W_{i} : i \in G)$ , where $W_{i} = h (X_{i}, Y_{i})$ and $h (x, y)$ is given by (5). Construct ${\tilde{X}}_{i}$ and ${\tilde{Y}}_{i}$ by respectively substituting $X_{i}$ and $Y_{i}$ in place of $W_{i}$ in $W$ for $i \in G$ .

3: Calculate $s_{i}^{X} \equiv P_{M, A} (θ_{i} = 0 | {\tilde{X}}_{i})$ and $s_{i}^{Y} \equiv P_{M, A} (θ_{i} = 0 | {\tilde{Y}}_{i})$ for $i \in G$ with the user-specified working model $M$ and algorithm $A$ . Let $G^{r} = {i : s_{i}^{X} < s_{i}^{Y}}$ .

4: Let $τ = sup {t \in S_{X} \cup S_{Y} : Q (t) \leq α}$ , where $Q (t)$ is defined in (2.3).

5: Let $δ_{i} = I {s_{i}^{X} \leq τ}$ for $i \in G^{r}$ , otherwise $δ_{i} = 0$ .

6: Return $δ = (δ_{i} : i \in G)$ .

We conclude this section with several remarks. First, in order to ensure pairwise exchangeability between $s_{i}^{X}$ and $s_{i}^{Y}$ under the null, it is crucial to estimate the working model $M$ from $W$ before calculating the conformity scores. Detailed explanations and practical guidelines for the deployment of PLIS under two widely-used working models can be found in Section 2.5. Second, we can define an alternative estimate of $FDP (t)$ , denoted as $Q^{'} (t)$ , by substituting $G$ in place of both $G^{r}$ and $G^{c}$ in (2.3). However, employing $Q^{'} (t)$ tends to be overly conservative; further elaborations on this are provided in Sections 3.3. Third, Algorithm 1 closely resembles the operation of the conformal BH, albeit with several modifications. Finally, as revealed by an insightful reviewer, $Q (t)$ in (2.3) is closely related to the Selective SeqStep + algorithm (Barber and Candès Citation2015). Further discussions on these connections can be found in Section D of the supplement.

2.4 Theory

We start by establishing several exchangeability properties, first for the data points and subsequently for the conformity scores. The random elements ${Z_{i} : i \in [K]}$ are said to be (jointly) exchangeable if the distribution of $(Z_{1}, \dots, Z_{K})$ is the same as that of $(Z_{Π_{1}}, \dots, Z_{Π_{K}})$ for any permutation $(Π_{1}, \dots, Π_{K})$ of the indices ${1, \dots, K}$ . We write $(Z_{1}, \dots, Z_{K}) \overset{d}{=} (Z_{Π_{1}}, \dots, Z_{Π_{K}}),$ where $\overset{d}{=}$ denotes equality in distribution. We will soon introduce a weaker notion, the pairwise exchangeability, which is more relevant to our theory. The following exchangeability properties are proven in Section A of the supplement.

The first property can be easily deduced from the definition of Model (2) and the construction of $W$ outlined in Algorithm 1.

Property 1 (Exchangeability and conditional independence between data points)

Suppose $X = (X_{i} : i \in G)$ are observations from Model (2), and $(Y_{i} : i \in G)$ are randomly drawn from the null distribution $F_{0}$ . Then we have: (a) The random variables ${Y_{i}, i \in G; X_{j}, j \in H_{0}}$ are jointly exchangeable conditional on $Θ$ ; (b) ${Z_{j} = {X_{j}, Y_{j}} : j \in G}$ are conditionally independent given $W$ and $Θ$ ; (c) For $i \in H_{0}$ , $X_{i}$ and $Y_{i}$ are pairwise exchangeable conditional on $W$ and $Θ$ , that is, $(X_{i}, Y_{i} | W, Θ) \overset{d}{=} (Y_{i}, X_{i} | W, Θ) .$

The next property characterizes the pairwise exchangeability between ${\tilde{X}}_{j}$ and ${\tilde{Y}}_{j}$ .

Property 2 (Pairwise exchangeability and conditional independence between datasets)

Consider ${\tilde{X}}_{j}$ and ${\tilde{Y}}_{j}$ as generated according to Algorithm 1. Then we have: (a) ${D_{j} = {{\tilde{X}}_{j}, {\tilde{Y}}_{j}} : j \in G}$ are conditionally independent given $W$ and $Θ$ ; (b) For $i \in H_{0}$ , ${\tilde{X}}_{i}$ and ${\tilde{Y}}_{i}$ are pairwise exchangeable, that is $({\tilde{X}}_{i}, {\tilde{Y}}_{i} | W, Θ) \overset{d}{=} ({\tilde{Y}}_{i}, {\tilde{X}}_{i} | W, Θ)$ .

The next property is concerned with the conditional independence and pairwise exchangeability between null scores.

Property 3 (Pairwise exchangeability and conditional independence between scores)

Consider scores $s_{j}^{X}$ and $s_{j}^{Y}$ computed according to Algorithm 1, denote $s_{j} = {s_{j}^{X}, s_{j}^{Y}}$ for $j \in G$ . Then we have: (a) ${s_{j} : j \in G}$ are conditionally independent given $W$ and $Θ$ ; (b) Denote $S_{- i} = \cup_{j \in G, j \neq i} s_{j}$ . For $i \in H_{0}$ , $s_{i}^{X}$ and $s_{i}^{Y}$ are pairwise exchangeable given $S_{- i}$ and $Θ$ : (7) $\begin{matrix} (s_{i}^{X}, s_{i}^{Y} | S_{- i}, Θ) \overset{d}{=} (s_{i}^{Y}, s_{i}^{X} | S_{- i}, Θ), or equivalently \\ (s_{i}^{X}, s_{i}^{Y}, S_{- i} | Θ) \overset{d}{=} (s_{i}^{Y}, s_{i}^{X}, S_{- i} | Θ) . \end{matrix}$ (7)

The pairwise exchangeability (7), which is crucial in proving the FDR theory for PLIS, is a weaker assumption than the joint exchangeability among all null scores ${s_{i}^{X} : i \in H_{0}} \cup {s_{i}^{Y} : i \in G}$ required in most existing conformal methods. Now we state our main theorem that establishes the validity of PLIS for controlling the FDR.

Theorem 1

. Consider Model (2) and the PLIS procedure outlined in Algorithm 1. Then Properties 1–3 hold and PLIS controls the FDR at level $α$ .

Our approach sets itself apart from current theories in conformal inference that rely on joint exchangeability (Yang et al. Citation2021; Mary and Roquain Citation2022; Bates et al. Citation2023; Marandon et al. Citation2024). In contrast, we introduce a framework that uses only pairwise exchangeability and establishes our theory grounded in the Selective SeqStep + algorithm (Barber and Candès Citation2015).

2.5 Practical Guidelines and Examples

PLIS uses resampling techniques to generate a mirror process that quantifies the FDP in multiple testing. At its core, PLIS can be situated within the conformal inference framework, which facilitates the development of valid FDR rules that exhibit enhanced robustness to model misspecification. This section presents examples that contextualize PLIS as a valuable tool for “conformalizing” some well-known FDR procedures, producing inferences that possess similar operational characteristics as state-of-the-art conformal methods (Yang et al. Citation2021; Bates et al. Citation2023; Marandon et al. Citation2024; Liang, Sesia, and Sun Citation2024).

Example 1

(Conformalizing the LIS procedure under HMMs)

A variety of real-world applications involve data with HMM-type structures. Several works (Sun and Cai 2009; Perrot-Dockès et al. Citation2023) have demonstrated that exploiting the HMM structure can greatly enhance the power of FDR analysis. However, their theoretical analysis necessitates certain conditions to be satisfied, such as homogeneous transition probabilities and identical non-null distributions, which may not be strictly met in practical scenarios. The HMMs are characterized by a set of parameters, denoted as $ϑ = (A, f_{0}, f_{1})$ , where $A$ is the transition probability matrix, and $f_{0}$ and $f_{1}$ are the emission densities for the null and non-null observations, respectively.

In our setup, we use $M_{HM}$ as the HMM working model, where $ϑ$ is estimated using the EM algorithm on $W$ . To implement the PLIS procedure, we use the forward-backward algorithm, denoted by $A_{FB}$ , to compute the scores $s_{i}^{X} = P_{M_{HM}, A_{FB}} (θ_{i} = 0 | {\tilde{X}}_{i})$ and $s_{i}^{Y} = P_{M_{HM}, A_{FB}} (θ_{i} = 0 | {\tilde{Y}}_{i})$ for $i \in G$ . The decision-making process is based on Algorithm 1. The PLIS procedure can be regarded as a conformalized adaptation of the LIS procedure proposed by Sun and Cai (2009), computed using $M_{HM}$ and algorithm $A_{FB}$ . PLIS differs from LIS in that it can handle a variety of data structures that deviate from HMMs. This ability ensures valid FDR control in finite-samples, even when the working model is misspecified or the model parameters are estimated poorly.

Example 2

(Conformalizing the adaptive z-value procedure under the two-group model (3) with independent observations). In the context where observations are assumed to be independent and Efron’s two-group model (3), denoted $M_{TG}$ , is adopted, Sun and Cai (Citation2007) showed that the density ratio (DR) $r (x) = \frac{f_{0} (x)}{f (x)}$ or the local false discovery rate $Lfdr (x) = (1 - π) r (x)$ serves as the optimal building block for FDR analysis. The PLIS framework treats DR and Lfdr equivalently since $(1 - π)$ is a constant across all study units and only the relative ranking contributes to the operation of PLIS. Using the working model $M_{TG}$ , the conformity score $s_{i} (\cdot)$ can be taken as the DR function $\hat{r} (\cdot) = f_{0} (\cdot) / \hat{f} (\cdot)$ , where $f_{0}$ is known, and $\hat{f} (\cdot)$ is the standard kernel density estimator constructed based on $W$ . The PLIS procedure operates by employing $s_{i}^{X} \equiv \hat{r} (X_{i})$ and $s_{i}^{Y} \equiv \hat{r} (Y_{i})$ in Algorithm 1. As carefully explained in Section D.7 of the supplement, although DR is permutation-invariant with respect to $W$ , the null scores are not jointly exchangeable. Nevertheless, PLIS still works since $(s_{i}^{X}, s_{i}^{Y})$ are pairwise exchangeable given $W$ and $Θ$ .

Sun and Cai (Citation2007) proposed a class of adaptive z-value (AZ) procedures, and showed that AZ outperforms p-value based methods by adapting to the shape of the alternative distribution. PLIS may be viewed as a conformalized adaptation of AZ, which effectively incorporates the structural information into inference while controlling the FDR in finite samples. This represents a notable advantage over the AZ procedure, which only ensures asymptotic FDR control. We note that the AdaDetect method (Marandon et al. Citation2024), which also uses the two-group model to estimate a permutation-invariant score function, exhibits similar advantages to PLIS. However, when encountering score functions that lack permutation-invariance, particularly when using the HMM as the working model, PLIS significantly deviates from AdaDetect.

Our customized PLIS procedures remain valid as long as the underlying model belongs to the broad class (2), regardless of the working models or algorithms used. Developing conformity scores using sensible models and computational algorithms is essential for improving the practicality and effectiveness of PLIS. Finally, instances of numerical issues such as significant reductions in power and convergence failures frequently serve as warnings for practitioners, highlighting the unsuitability of the current model for the specific application and emphasizing the need for a new working model.

2.6 Semi-Supervised PLIS

This section considers the scenario in which the null distribution is unknown, and users only have access to n labeled null data points $U = {U_{1}, \dots, U_{n}}$ drawn from that distribution. The objective is to assess whether new data points ${X_{i} : i \in G}$ follow the same unknown distribution as the reference dataset $U$ . This scenario is known as the semi-supervised multiple testing problem (Mary and Roquain Citation2022), or the outlier detection problem in the conformal inference literature (Bates et al. Citation2023; Marandon et al. Citation2024). We introduce a semi-supervised PLIS (Algorithm 2) to solve this problem.

The algorithm first partitions the labeled null data $U$ into two subsets: the calibration set $Y = (Y_{i} : i \in G)$ and the training set $U^{tr} = U ∖ Y$ . The test data $X$ and calibration data $Y$ are then used to construct the baseline data $W$ by following the same steps in Algorithm 1. We move on to the estimation issue. In the case of an HMM, an estimator for the emission distribution $f_{0}$ can be learned from the labeled null data $U^{tr}$ . Subsequently, we use the baseline data $W$ to estimate $f_{1}$ and transition probabilities via the EM algorithm, and then calculate the conformity scores via the forward-backward procedure. If the working model is a two-group model, the conformity scores correspond to the density ratios, which can be estimated using both $U^{tr}$ and $W$ via the positive unlabeled (PU) learning algorithms. This approach has also been suggested by Marandon et al. (Citation2024). The above steps are summarized in Algorithm 2.

Algorithm 2

The semi-supervised PLIS procedure

Input: The test data $X = (X_{i} : i \in G)$ , the null samples $U = {U_{1}, \dots, U_{n}}$ , a pre-specified FDR level $α$ , a working model $M$ and a computational algorithm $A$ .

Output: A decision rule $δ = (δ_{i} : i \in G)$ .

1: Split $U$ into the calibration data $Y = (Y_{i} : i \in G)$ and the training data $U^{tr} = U ∖ Y$ .

4: Let $τ = sup {t \in S_{X} \cup S_{Y} : Q (t) \leq α}$ , where $Q (t)$ is defined in (2.3).

5: Let $δ_{i} = I {s_{i}^{X} \leq τ}$ for $i \in G^{r} = {j \in G : s_{j}^{X} < s_{j}^{Y}}$ , otherwise $δ_{i} = 0$ .

6: Return $δ = (δ_{i} : i \in G)$ .

Remark 2

Here we have assigned exactly one null data point to each test unit, but in situations where there is an abundance of null samples, it may be appropriate to employ the de-randomization idea in Section 3.2 to improve stability. In situations where the size of the null dataset is smaller than m, one may conduct an independent screening procedure to narrow down the focus before using PLIS on a smaller subset of hypotheses. These represent interesting issues for future investigation.

The next theorem, established using de Finetti’s theorem (see Section C in the supplement) and techniques employed in proving Properties 1–3, generalizes Theorem 1 in two ways. First, it considers the semi-supervised setup, where we have access to labeled null data instead of explicit knowledge of the null distribution. Second, it relaxes the assumption of conditional independence given $Θ$ in Model (2) by allowing for (exchangeable) correlated noise in the data generation process.

Theorem 2

Consider Algorithm 2, if the null data ${U_{1}, \dots, U_{n}, X_{i}, i \in H_{0}}$ are exchangeable conditional on the non-null data ${X_{i} : i \notin H_{0}}$ and the true states, then (a) for $i \in H_{0}$ , $s_{i}^{X}$ and $s_{i}^{Y}$ are pairwise exchangeable, satisfying (7); (b) Algorithm 2 controls FDR at $α$ .

In Section E.4 of the supplement, we provide numerical results to corroborate Theorem 2 by considering the semi-supervised setup in which correlated noises are introduced to the observations generated from Model (2).

2.7 Conformal q-values

When the joint exchangeability condition fails to hold, conformal p-values can no longer be properly defined. To address this issue, we introduce the concept of conformal q-value as a significance index to measure the risk associated with individual decisions.

We begin by examining $Q (t)$ in (2.3), which offers a conservative estimate of the FDP. Consider the minimum FDR level $α$ at which $H_{0, i}$ can be “just” rejected: (8) $q_{i} \equiv min_{t \in S_{X} \cup S_{Y}, t \geq s_{i}^{X}} Q (t), for i \in G^{r}; q_{i} \equiv 1, for i \in G ∖ G^{r} .$ (8)

The adjusted $q_{i}$ is referred to as the conformal q-value corresponding to $H_{0, i}$ , owing to its resemblance to the q-value idea introduced by Storey (Citation2003). While Storey’s q-value is constructed based on the empirical distribution of p-values, our conformal q-value is derived from a resampling method and a carefully designed mirror process.

The conformal q-value is a valid and user-friendly significance index that provides clear interpretability for individual decisions, and practitioners can directly use it for decision-making by comparing them with a pre-specified $α$ . Theorem 1 and Proposition 1 below together establish the validity of using the conformal q-value (8) in FDR analysis.

Proposition 1

Consider the PLIS procedure as outlined in Algorithm 1 and the conformal q-value defined by (8). Then the decision $δ_{i}$ presented in Algorithm 1 is equivalent to rejecting $H_{0, i}$ if $q_{i} \leq α$ .

3 Connections to Existing Works and Extensions

In this section, we first establish the connection between PLIS and the e-BH method (Section 3.1), then introduce several ensuing extensions: derandomized PLIS (Section 3.2), as well as ${PLIS}_{cbh}$ and ${PLIS}_{sym}$ (Sections 3.3). Finally, we discuss the distinctions of PLIS from related works (Section 3.4).

3.1 Connection to the e-BH Procedure

In hypothesis testing, an e-value (Vovk and Wang Citation2021) is defined as the observed value of a nonnegative random variable E that satisfies the condition $E [E] \leq 1$ under the null hypothesis. E-values can be constructed using betting scores (Shafer Citation2021), likelihood ratios, and stopped super-martingales (Grünwald, de Heide, and Koolen Citation2020). This section demonstrates how the PLIS framework can be used to construct robust and powerful e-values.

Let $(e_{i} : 1 \leq i \leq m)$ be the e-values for testing $(H_{0, i} : 1 \leq i \leq m)$ . Wang and Ramdas (Citation2022) proposed the e-BH procedure, which involves first ordering the e-values as $e_{(1)} \geq e_{(2)} \geq \dots \geq e_{(m)}$ , and then choosing a cutoff along the ranking using the following step-wise algorithm. Let $\hat{k} = max {i : (i / m) e_{(i)} \geq (1 / α)}$ , then reject hypotheses in the set $R_{ebh} = {1 \leq j \leq m : e_{j} \geq e_{(\hat{k})}}$ . We call ${e_{j} : 1 \leq j \leq m}$ a set of generalized e-values if (9) $E (\sum_{j \in H_{0}} e_{j}) \leq m .$ (9)

Condition (9) is strictly weaker than the condition that $E [e_{j}] \leq 1$ for all $j \in H_{0}$ . Wang and Ramdas (Citation2022) proved that if ${e_{j} : 1 \leq j \leq m}$ are a set of generalized e-values, then the e-BH procedure controls FDR at $α$ under arbitrary dependence. We can construct a set of generalized e-values based on the PLIS framework. Specifically, define (10) $e_{j} = \frac{m δ_{j}}{1 + \sum_{i \in G^{c}} I {s_{i}^{Y} \leq τ}}, for j \in G,$ (10) where $δ_{j}$ , $G^{c}$ and $τ$ are defined by Algorithm 1. The following theorem can be proved using similar techniques for proving Theorem 1.

Theorem 3

The variables $e_{j}$ defined in (10) constitute a set of generalized e-values.

The following proposition demonstrates that the PLIS procedure is equivalent to implementing the e-BH procedure with generalized e-values as defined by (10). The connection to e-BH offers valuable insights into our theory, providing a new perspective on why PLIS performs robustly in the face of general dependence and model misspecification.

Proposition 2

If we implement the e-BH procedure with e-values in (10), then the rejection set $R_{ebh} = R$ , where $R = {i \in G : δ_{i} = 1}$ is the set of rejected hypotheses by Algorithm 1.

3.2 Derandomized PLIS

While the PLIS framework employs resampling methods, which can introduce additional uncertainties and may be viewed as undesirable by practitioners, the fact that the average of e-values is still an e-value (Vovk and Wang Citation2021) offers motivation to consider a principled derandomization method. This involves aggregating multiple random samples to reduce variability. Similar ideas have been used in contemporaneous works, including Ren and Barber (Citation2023) and Bashari et al. (Citation2023).

The derandomized PLIS procedure involves running the PLIS procedure repeatedly for N times and averaging the results for decision-making. Specifically, for the kth replication, $k \in [N]$ , we generate ${Y_{j}^{(k)} : j \in G} \overset{iid}{\sim} f_{0}$ , and construct ${\tilde{X}}_{i}^{(k)}$ and ${\tilde{Y}}_{i}^{(k)}$ by respectively substituting $X_{i}^{(k)}$ and $Y_{i}^{(k)}$ in place of $W_{i}^{(k)}$ in $W^{(k)}$ , where $W^{(k)}$ is created by combining $X$ and $Y^{(k)} = (Y_{i}^{(k)} : i \in G)$ via (5). We then compute the scores ${s_{i}^{X (k)}, s_{i}^{Y (k)} : i \in G}$ and threshold $τ^{(k)}$ based on Algorithm 1. The corresponding e-values ${e_{i}^{(k)} : i \in G}$ for replication k can be calculated using (10). This process is repeated for $k = 1, \dots, N$ . Finally, we average the individual e-values $e_{i}^{(k)}$ to compute the summary e-values ${\bar{e}}_{i} = \frac{1}{N} \sum_{k = 1}^{N} e_{i}^{(k)}$ for all $i \in G$ , and apply the e-BH procedure with ${{\bar{e}}_{i} : i \in G}$ . The proposed derandomized PLIS procedure is summarized in Algorithm 3. The theoretical FDR control of derandomized PLIS follows from the validity of the e-BH procedure and Theorem 3.

Remark 3

Algorithm 3 has allowed different replications to have varied ${(α_{k})}_{k = 1}^{N}$ . As noted by Ren and Barber (Citation2023), derandomization could lead to higher power if the e-values are carefully aggregated through powerful weighting schemes. We have investigated this issue with some preliminary results provided in Section E.6 in the supplement. The development of optimal weights is an interesting direction for future research.

Algorithm 3

The Derandomized PLIS procedure

Input: Observations $X = (X_{i} : i \in G)$ , the null distribution $F_{0}$ , working model $M$ , algorithm $A$ , number of replications N, weights ${(α_{k})}_{k = 1}^{N}$ , nominal FDR level $α$ .

Output: Decisions $δ = (δ_{i} : i \in G)$ .

1: for all $k = 1, 2, \dots, N$ do

2: Generate ${Y_{j}^{(k)}}_{j = 1}^{m} \overset{iid}{\sim} F_{0}$ . Denote $Y^{(k)} = (Y_{j}^{(k)} : j \in G)$ .

3: Create $W^{(k)} = (W_{i}^{(k)} : i \in G)$ by combining $X$ and $Y^{(k)}$ via (5). Construct ${\tilde{X}}_{i}^{(k)}$ and ${\tilde{Y}}_{i}^{(k)}$ by respectively substituting $X_{i}$ and $Y_{i}^{(k)}$ in place of $W_{i}^{(k)}$ in $W^{(k)}$ for $i \in G$ .

4: Calculate $s_{i}^{X (k)} \equiv P_{M, A} (θ_{i} = 0 | {\tilde{X}}_{i}^{(k)})$ and $s_{i}^{Y (k)} \equiv P_{M, A} (θ_{i} = 0 | {\tilde{Y}}_{i}^{(k)})$ for $i \in G$ with user-specified working model $M$ and algorithm $A$ . Let $G^{r (k)} = {i \in G : s_{i}^{X (k)} < s_{i}^{Y (k)}}$ , $G^{c (k)} = {i \in G : s_{i}^{Y (k)} < s_{i}^{X (k)}}$ , and $S_{X}^{(k)} = {s_{i}^{X (k)}}_{i \in G}$ , $S_{Y}^{(k)} = {s_{i}^{Y (k)}}_{i \in G}$ .

5: Let $Q^{(k)} (t) = \frac{1 + \sum_{i \in G^{c (k)}} I {s_{i}^{Y (k)} \leq t}}{1 \lor (\sum_{i \in G^{r (k)}} I {s_{i}^{X (k)} \leq t})}$ and $τ^{(k)} = sup {t \in S_{X}^{(k)} \cup S_{Y}^{(k)} : Q^{(k)} (t) \leq α_{k}}$ .

6: Calculate $e_{i}^{(k)} = \frac{m I {s_{i}^{X (k)} \leq τ^{(k)}}}{1 + \sum_{j \in G^{c (k)}} I {s_{j}^{Y (k)} \leq τ^{(k)}}}$ for $i \in G^{r (k)}$ , otherwise $e_{i}^{(k)} = 0$ .

7: end for

8: Let ${\bar{e}}_{i} = \frac{1}{N} \sum_{k = 1}^{N} e_{i}^{(k)}$ for $i \in G$ . Denote the ordered statistics by ${\bar{e}}_{(1)} \geq {\bar{e}}_{(2)} \geq \dots \geq {\bar{e}}_{(m)}$ . Let $\hat{k} = max {i : (i {\bar{e}}_{(i)} / m) \geq (1 / α)}$ .

9: Let $δ_{i} = I {{\bar{e}}_{i} \geq {\bar{e}}_{(\hat{k})}}$ for $i \in G$ .

Return $δ = (δ_{i} : i \in G)$

3.3 Comparison with Conformal and Knockoff Methods

PLIS is developed based on the principles of conformal inference, which involve the utilization of machine learning algorithms to calculate conformity scores and the use of calibration data to quantify uncertainties in decision-making. Inspired by an insightful comment from a reviewer, Section D.4 of the supplement shows that the mirror process (2.3), originally adapted from the conformal BH algorithm (see Section D.3), falls into the class of Selective SeqStep + algorithms (Barber and Candès Citation2015). Our innovative construction of baseline data $W$ , development of pairwise exchangeable scores, and utilization of the new mirror process offer several enhancements to existing conformal methods, enabling the extension of the knockoff framework beyond regression settings (Barber and Candès Citation2015; Candès et al. Citation2018; Ren and Candès Citation2023).

This section explores two variations of PLIS, emphasizing its connection to existing works and providing insights into its advantages.

The first variation replaces both $G^{r}$ and $G^{c}$ with $G$ in Algorithm 1. This variation, denoted as ${PLIS}_{cbh}$ , essentially involves applying the conformal BH algorithm (Mary and Roquain Citation2022; Bates et al. Citation2023) to the conformity scores ${(s_{i}^{X}, s_{i}^{Y}) : i \in G}$ generated by Algorithm 1. However, implementing the conformal BH algorithm with pairwise exchangeable scores may result in overly conservative FDR levels. Additionally, using the adaptive BH algorithm (Bates et al. Citation2023; Marandon et al. Citation2024) to overcome this conservativeness is inappropriate, as conformal p-values computed from pairwise exchangeable scores do not satisfy the super-uniformity and PRDS conditions.

The second variation involves applying symmetrization techniques, which have shown success in regression contexts (Barber and Candès Citation2015; Candès et al. Citation2018; Ren and Candès Citation2023; Du et al. Citation2023). Its key challenge lies in identifying efffective anti-symmetric functions that fulfill $T_{j} (a, b) = - T_{j} (b, a)$ to transform pairwise exchangeable scores into symmetrized test statistics. In Section D.4 of the supplement, we demonstrate that the mirror process (2.3) corresponds to using the Selective SeqStep + algorithm proposed by Barber and Candès (Citation2015) with a carefully designed anti-symmetric function. To underscore the significance of the symmetrization process, we consider a popular choice of anti-symmetric function $T_{j} \equiv s_{j}^{Y} - s_{j}^{X}$ , as suggested in Candès et al. (Citation2018) and their later works. This approach, labeled as ${PLIS}_{sym}$ , rejects $H_{0, j}$ if $T_{j} = s_{j}^{Y} - s_{j}^{X} \geq τ_{sym}$ , where $τ_{sym}$ is determined by the following FDP process: $τ_{sym} = inf {t \in {| T_{j} | : j \in G} : \frac{1 + \sum_{j \in G} I {T_{j} \leq - t}}{\sum_{j \in G} I {T_{j} \geq t}} \leq α} .$ However, using $T_{j} = s_{j}^{Y} - s_{j}^{X}$ may lead to information loss because the contrast has a tendency to cancel out the inherent clustering structure among non-null effects.

Section E.8 presents numerical results that illustrate the advantages of PLIS over its two variations ${PLIS}_{cbh}$ and ${PLIS}_{sym}$ .

3.4 Comparison with Covariate-Adaptive Methods

Structured multiple testing is a subject of extensive research that has attracted significant interest. Our PLIS procedure distinguishes itself from existing covariate-adaptive methods in several aspects. First, in the works of Lei and Fithian (Citation2018), Ignatiadis and Huber (Citation2021) and Ren and Candès (Citation2023), the structural information is represented by a covariate sequence and is incorporated through techniques such as weighted p-values or covariate-adaptive thresholds. By contrast, our approach leverages structural knowledge about the underlying data generation process and integrates it into inference via a user-specified working model. Second, the theory in existing methods is developed based on group-wise cross-weighting (Ignatiadis and Huber Citation2021) or covariate modulation through data masking (Lei and Fithian Citation2018; Ren and Candès Citation2023). In contrast, PLIS operates within the theoretical framework of conformal inference, harnessing the flexibility of machine learning algorithms to learn powerful conformity scores, with associated decision risk being assessed by employing carefully calibrated null scores. Thus, PLIS not only facilitates the effective incorporation of structural information but also guarantees the validity of inference, even in scenarios where the machine learning models are mis-specified. Finally, PLIS offers an intuitive, fast, and user-friendly computational algorithm that efficiently computes scores from the baseline data. This differs from the more computationally intensive algorithms proposed in Lei and Fithian (Citation2018) and Ren and Candès (Citation2023), which involve intricate techniques such as adaptive unmasking and boundary updating.

4 Simulation

This section presents simulation results that compare PLIS, which is implemented as described in the examples of Section 2.5, with several competing methods. For all of the numerical experiments, we set $m = 2000$ and $α = 0.05$ . The observations are generated following this model: $X_{i} | θ_{i} \overset{ind .}{\sim} (1 - θ_{i}) N (0, 1) + θ_{i} N (μ, 1)$ . The latent states $Θ$ obey various dependence structures in different experiments. The following methods are considered in our analysis: (a) ${PLIS}_{HM},$ Footnote² implemented via Algorithm 1, using the working model $M_{HM}$ as described in Section 2.5; (b) BH (Benjamini and Hochberg Citation1995); (c) AdaDetect (Marandon et al. Citation2024), where the density ratio is estimated by $\hat{r} (\cdot) = f_{0} (\cdot) / \hat{f} (\cdot)$ , with $f_{0}$ being known and $\hat{f} (\cdot)$ being the standard kernel estimator based on ${X, Y}$ . (d) ${PLIS}_{TG}$ , implemented via Algorithm 1 using the working model $M_{TG}$ as described in Section 2.5. (e) AdaPT (Lei and Fithian Citation2018), implemented using the R package adaptMT. Throughout all the simulations conducted, the FDR and Average Power (AP) levels are computed by averaging the results obtained from 200 replications.

4.1 Comparisons with Symmetric Rules

We first consider HMMs in which $Θ = {(θ_{i})}_{i = 1}^{m}$ is a binary Markov chain with transition matrix $A = {(a_{ij})}_{i, j = 0, 1} = {(P (θ_{t + 1} = j | θ_{t} = i))}_{i, j = 0, 1}$ . Here, we fix $a_{00} = 0.95$ , and the initial state of the latent chain is set to be $θ_{1} = 0$ .

We apply BH, AdaDetect, AdaPT, ${PLIS}_{HM}$ , and ${PLIS}_{TG}$ to the simulated data, and summarize the results in . Our results demonstrate that all of the methods under consideration effectively control the FDR at the nominal level. Both the ${PLIS}_{TG}$ and AdaDetect exhibit similar performance and outperform BH in terms of power. Notably, ${PLIS}_{HM}$ exhibits the highest power among all five methods in most scenarios. This can be attributed to the use of asymmetric rules in ${PLIS}_{HM}$ . AdaPT outperforms/underperforms BH when $a_{11}$ becomes large/small. Additional simulation results for HMMs are presented in Section E.2 of the supplement.

Figure 2: The comparison of FDR and AP for HMMs with $μ = 2.6$ and varying $a_{11}$ .

4.2 Comparisons beyond HMMs

This section examines the case where the true transition matrices change over time, which is commonly referred to as a heterogeneous HMM. Such models are particularly useful since the structure of real-world data is often nonstationary. Specifically, we define $A^{(k)} = {(a_{ij}^{(k)})}_{i, j = 0, 1}$ as the transition probability matrix for the kth transition, where $a_{ij}^{(k)} = P (θ_{k} = j | θ_{k - 1} = i)$ , and $a_{00}^{(k)} = 0.95$ is fixed for all k. The conditional distributions remain the same as in the preceding simulations. We consider a scenario in which $a_{11}^{(k)} = 0.9 e^{- k / 1000}$ decreases with k exponentially. This decreasing trend indicates that the stochastic system will gradually stabilize over time, with outliers becoming increasingly rare within the clusters. Since the data in this scenario possess a complex structure, accurately modeling them can be challenging. However, these data do exhibit patterns resembling HMMs, with moderate deviations from homogeneity conditions. Hence, we adopt a homogeneous HMM as a working model to capture the underlying structural patterns in the data and subsequently use the PLIS framework for inference. To assess the performance of PLIS and to compare it with other methods, specifically BH, AdaDetect, and AdaPT, we vary the parameter $μ$ from 2.2 to 3.2 and apply these methods to simulated data. The results are summarized in .

Figure 3: FDR and AP comparison for the heterogeneous HMM with exponentially vanishing $a_{11}$ .

Our results demonstrate that all methods are effective in controlling the FDR, with PLIS exhibiting higher average power compared to other methods. This highlights the effectiveness of PLIS in leveraging structural information, even when wrong models and algorithms are employed. Additionally, our analysis of more general data structures, such as two-layer dynamic models and structured models generated from a renewal process, demonstrates that PLIS remains more powerful than the other methods in most cases (see Figures 8–12 in Section E of the supplement). These findings underscore the versatility and robustness of PLIS in a variety of settings, suggesting its considerable potential for analyzing complex data structures in practice.

4.3 Comparisons Under a Covariate-Adaptive Model

Let ${S_{i} : i \in G}$ denote the covariates that encode side information. Consider a covariate-adaptive model that can be described using a hierarchical approach: $θ_{i} | (S_{i} = s) \overset{ind .}{\sim} Bernoulli (π_{s}), X_{i} | (S_{i} = s, θ_{i} = 0) \overset{iid}{\sim} F_{0} (x), X_{i} | (S_{i} = s, θ_{i} = 1) \overset{ind .}{\sim} F_{1 s} (x),$ where $P (θ_{i} = 1 | S_{i} = s) = π_{s}$ represents the local sparsity level. An equivalent model, motivated from an empirical Bayes perspective, is given by $X_{i} | (S_{i} = s) \overset{ind .}{\sim} F_{s} (x) = (1 - π_{s}) F_{0} (x) + π_{s} F_{1 s} (x) .$

Let $S_{i} = i$ for $i \in [3000]$ and fix $F_{0} = N (0, 1)$ . We consider the following scenarios: (i) $F_{1 s} = N (μ + 0.2 sin (0.6 s), 1)$ ; $π_{s} = 0.4 (1 + sin (0.2 s))$ for $s \in [201, 500] \cup [801, 1100] \cup [1501, 1800] \cup [2101, 2400]$ , and $π_{s} = 0.02$ otherwise; (ii) $F_{1 s} = N (2.8, 1)$ ; $π_{s} = 2 π$ for $s \in [201, 350] \cup [1501, 1650]$ , $π_{s} = π$ for $s \in [801, 1000] \cup [2101, 2300]$ , and $π_{s} = 0.02$ otherwise. Hence, the external covariate $S_{i}$ captures the local dependence structure, with $π_{s}$ and $F_{1 s}$ exhibiting similar values when $S_{i}$ values are comparable.

We apply PLIS (employing the HMM as its working model), LIS (Sun and Cai 2009; using the HMM as its underlying true model), BH and AdaPT (Lei and Fithian Citation2018; using the R package adaptMT with ${(S_{i})}_{i = 1}^{m}$ as its covariate sequence) at FDR level $α = 0.05$ . The results are presented in . The following patterns can be observed. First, AdaPT demonstrates increased efficiency as $S_{i}$ becomes more informative and may outperform both BH and PLIS in certain settings. Second, LIS, which mistakenly assumes that the data are generated from an HMM, suffers from severe FDR inflation. In contrast, PLIS, despite using the same EM algorithm as LIS, remains effective in FDR control even when the true model significantly diverges from the working model.

Figure 4: FDR and AP comparison under covariate-adaptive models.

5 Application

The identification of eye states has wide-ranging applications across various fields. By monitoring changes in eye states, such as eyelid closures and eye movements, systems can alert drivers when they exhibit signs of drowsiness, thereby preventing potential accidents. In mental health diagnosis, analyzing eye states can offer valuable insights into conditions such as schizophrenia or autism spectrum disorder. Eye state information can also effectively improve facial recognition performance and human–computer interactions.

This section illustrates semi-supervised PLIS for outlier detection using the Electroencephalogram (EEG) Eye State dataset, which is accessible at the UCI Machine Learning Repository (https://archive.ics.uci.edu). By conceptualizing the eye state sequence in a visual task as a Markovian process, the HMM provides a useful tool for modeling eye movement patterns, primarily due to its ability in capturing temporal dependencies within nearby latent states. In recent works such as Chuk et al. (Citation2020) and Hsiao et al. (Citation2022), it is shown that HMMs can effectively represent the dynamic characteristics of eye movement behaviors during a range of cognitive activities, offering valuable insights into the interplay between eye movement patterns and individual differences in cognitive styles or abilities. Furthermore, these studies indicate that the HMM is an appropriate model for describing the local dependence structure within eye states.

During the data preparation process, the eye states of one individual were continuously recorded in chronological order, with detections and manual labeling denoted as 0/1 to signify the eye-open/eye-closed states. The analysis uses data collected from an experiment focusing on measuring the Frontal left Hemisphere (F7). The primary goal is to pinpoint any eye-closed states (outliers) within a sequence of 1500 consecutive records (test samples). After collecting the test dataset, the remaining samples of this measurement with label “0” serve as the null samples. Concretely, we set aside the test samples and proceeded to generate the calibration samples (consisting of 1500 data points) and training samples by splitting the remaining data points with a label of “0”. Subsequently, we applied semi-supervised/conformal multiple testing methods, employing the null samples in both training and calibrating stages. The training data is also used to evaluate the null distribution and calculate p-values when implementing BH and AdaPT.

For the anomaly detection task, we implement various methods at the nominal FDR level of 0.05. The methods being considered include BH, AdaPT, AdaDetect, as well as the proposed semi-supervised PLIS (Algorithm 2) with two distinct working models: the HMM and Efron’s two-group model, respectively denoted as ${PLIS}_{HM}$ and ${PLIS}_{TG}$ . The counts of discovered outliers and corresponding FDPs, computed based on the true labels of test samples and presented in parentheses, are provided for each method below: ${PLIS}_{HM}$ : 69 (0.014), AdaPT: 38 (0), AdaDetect/ ${PLIS}_{TG}$ : 37 (0.054), and BH: 12 (0.083). Notably, the detected outliers are identical for AdaDetect and ${PLIS}_{TG}$ , corroborating the concluding remark in Example 2 of Section 2.5 regarding the similarity of AdaDetect and ${PLIS}_{TG}$ when employing Efron’s two-group model as the working model.

To further examine the comparison results, we visualize the outliers identified by ${PLIS}_{HM}$ , AdaPT, and AdaDetect in . The following patterns can be observed. First, AdaPT shows notable efficacy in detecting large outlier clusters compared to AdaDetect, which may be attributed to its successful integration of structural patterns, as encoded in the external covariate “Time”, in the decision-making process. Second, AdaPT exhibits reduced sensitivity in cases involving smaller clusters, leading to the omission of the final segment where structural information is relatively weak. Third, AdaDetect seems to demonstrate superior effectiveness in identifying isolated outliers. However, it turns out that the discovery at the single location of 711 is indeed a false positive, with the true label being “0”. This aligns with our intuition: if a person closes the eyes, the closed state will persist for a certain period of time. Conversely, by leveraging the structural information, ${PLIS}_{HM}$ and AdaPT successfully filter out this seemingly plausible signal. Finally, ${PLIS}_{HM}$ outperforms both AdaPT and AdaDetect by detecting the highest number of outliers while keeping the FDP well below the nominal level. This is made possible through the utilization of the HMM working model, which effectively exploits the informative structures within the data.

Figure 5: The discoveries of ${PLIS}_{HM}$ (red, top), AdaPT (green, middle) and AdaDetect (blue, bottom) in detecting eye-closed states in 1500 consecutive records according to F7 measurements.

Supplementary Materials

The supplement contains the proofs of theorems and propositions, some extensions of the proposed PLIS procedure, a comparative review of PLIS and related methods, and additional numerical results.

Supplemental material

8._Supplementary_Materials-PLIS__2_.zip

Download Zip (486.9 KB)

Acknowledgments

The authors thank the associate editor and two referees for their valuable comments, which have greatly enhanced the article. The comment from an insightful referee on the Selective SeqStep + algorithm and anti-symmetric functions has contributed significantly to enhance our understanding of related works.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Notes

(p_{1}, \dots, p_{m})

have the positive regression dependency on each subset (PRDS) property on

H_{0}

P {(p_{j}, j \in [m]) \in D | p_{i} = u}

is nondecreasing in u for any

i \in H_{0}

and any nondecreasing set

D \subset R^{m}

2 In subsequent simulations, ${PLIS}_{HM}$ is referred to as PLIS if ${PLIS}_{TG}$ is not involved.

References

Barber, R. F., and Candès, R. J. (2015), “Controlling the False Discovery Rate via Knockoffs,” The Annals of Statistics, 43, 2055–2085. DOI: 10.1214/15-AOS1337.
Web of Science ®Google Scholar
Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2023), “De finetti’s Theorem and Related Results for Infinite Weighted Exchangeable Sequences,” arXiv preprint arXiv:2304.03927.
Google Scholar
Bashari, M., Epstein, A., Romano, Y., and Sesia, M. (2023), “Derandomized Novelty Detection with FDR Control via Conformal e-values,” arXiv preprint arXiv:2302.07294.
Google Scholar
Bates, S., Candès, E., Lei, L., Romano, Y., and Sesia, M. (2023), “Testing for Outliers with Conformal p-values,” The Annals of Statistics, 51, 149–178. DOI: 10.1214/22-AOS2244.
Web of Science ®Google Scholar
Benjamini, Y., and Heller, R. (2007), “False Discovery Rates for Spatial Signals,” Journal of the American Statistical Association, 102, 1272–1281. DOI: 10.1198/016214507000000941.
Web of Science ®Google Scholar
Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society, Series B, 57, 289–300. DOI: 10.1111/j.2517-6161.1995.tb02031.x.
Web of Science ®Google Scholar
Benjamini, Y., and Yekutieli, D. (2001), “The Control of the False Discovery Rate in Multiple Testing Under Dependency,” The Annals of Statistics, 29, 1165–1188. DOI: 10.1214/aos/1013699998.
Web of Science ®Google Scholar
Candès, E., Fan, Y., Janson, L., and Lv, J. (2018), “Panning for Gold: ‘Model-x’ Knockoffs for High Dimensional Controlled Variable Selection,” Journal of the Royal Statistical Society, Series B, 80, 551–577. DOI: 10.1111/rssb.12265.
Google Scholar
Chuk, T., Chan, A. B., Shimojo, S., and Hsiao, J. H. (2020), “Eye Movement Analysis with Switching Hidden Markov Models,” Behavior Research Methods, 52, 1026–1043. DOI: 10.3758/s13428-019-01298-y.
PubMed Web of Science ®Google Scholar
Copas, J. B. (1974), “On Symmetric Compound Decision Rules for Dichotomies,” Annals of Statistics, 2, 199–204.
Web of Science ®Google Scholar
Diaconis, P., and Freedman, D. (1980), “Finite Exchangeable Sequences,” The Annals of Probability, 8, 745–764. DOI: 10.1214/aop/1176994663.
Web of Science ®Google Scholar
Du, L., Guo, X., Sun, W., and Zou, C. (2023), “False Discovery Rate Control Under General Dependence by Symmetrized Data Aggregation,” Journal of the American Statistical Association, 118, 607–621. DOI: 10.1080/01621459.2021.1945459.
Web of Science ®Google Scholar
Durrett, R. (2019), Probability: Theory and Examples (Vol. 49), Cambridge: Cambridge University Press.
Google Scholar
Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001), “Empirical Bayes Analysis of a Microarray Experiment,” Journal of the American Statistical Association, 96, 1151–1160. DOI: 10.1198/016214501753382129.
Web of Science ®Google Scholar
Fan, J., Han, X., and Gu, W. (2012), “Estimating False Discovery Proportion Under Arbitrary Covariance Dependence,” Journal of the American Statistical Association, 107, 1019–1035. DOI: 10.1080/01621459.2012.720478.
PubMed Web of Science ®Google Scholar
Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, Cambridge, MA: MIT Press. http://www.deeplearningbook.org.
Google Scholar
Grünwald, P., de Heide, R., and Koolen, W. M. (2020), “Safe Testing,” in 2020 Information Theory and Applications Workshop (ITA), pp. 1–54. IEEE. DOI: 10.1109/ITA50056.2020.9244948.
Google Scholar
Heath, D., and Sudderth, W. (1976), “De Finetti’s Theorem on Exchangeable Variables,” The American Statistician, 30, 188–189. DOI: 10.1080/00031305.1976.10479175.
Web of Science ®Google Scholar
Hsiao, J. H., An, J., Hui, V. K. S., Zheng, Y., and Chan, A. B. (2022), “Understanding the Role of Eye Movement Consistency in Face Recognition and Autism through Integrating Deep Neural Networks and Hidden Markov Models,” npj Science of Learning, 7, 1–12. DOI: 10.1038/s41539-022-00139-6.
PubMedGoogle Scholar
Ignatiadis, N., and Huber, W. (2021), “Covariate Powered Cross-Weighted Multiple Testing,” Journal of the Royal Statistical Society, Series B, 83, 720–751. DOI: 10.1111/rssb.12411.
Google Scholar
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001), “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” in Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 282–289. Morgan Kaufmann Publishers Inc.
Google Scholar
Lei, L., and Fithian, W. (2018), “AdaPT: An Interactive Procedure for Multiple Testing with Side Information,” Journal of the Royal Statistical Society, Series B, 80, 649–679. DOI: 10.1111/rssb.12274.
Google Scholar
Liang, K., and Nettleton, D. (2010), “A Hidden Markov Model Approach to Testing Multiple Hypotheses on a Tree-Transformed Gene Ontology Graph,” Journal of the American Statistical Association, 105, 1444–1454. DOI: 10.1198/jasa.2010.tm10195.
Web of Science ®Google Scholar
Liang, Z., Sesia, M., and Sun, W. (2024), “Integrative Conformal p-values for Out-of-Distribution Testing with Labelled Outliers,” Journal of the Royal Statistical Society, Series B, 1–25. DOI: 10.1093/jrsssb/qkad138.
Google Scholar
Liu, J., Zhang, C., and Page, D. (2016), “Multiple Testing Under Dependence via Graphical Models,” The Annals of Applied Statistics, 10, 1699–1724. DOI: 10.1214/16-AOAS956.
Web of Science ®Google Scholar
Marandon, A., Lei, L., Mary, D., and Roquain, E. (2024), “Adaptive Novelty Detection with False Discovery Rate Guarantee,” The Annals of Statistics, 52, 157–183. DOI: 10.1214/23-AOS2338.
Web of Science ®Google Scholar
Mary, D., and Roquain, E. (2022), “Semi-Supervised Multiple Testing,” Electronic Journal of Statistics, 16, 4926–4981. DOI: 10.1214/22-EJS2050.
Web of Science ®Google Scholar
Onsager, L. (1944), “Crystal Statistics. I. A Two-Dimensional Model with an Order-Disorder Transition,” Physical Review, 65, 117–149. DOI: 10.1103/PhysRev.65.117.
Google Scholar
Perrot-Dockès, M., Blanchard, G., Neuvial, P., and Roquain, E. (2023), “Post Hoc False Discovery Proportion Inference Under a Hidden Markov Model,” TEST, 32, 1365–1391. DOI: 10.1007/s11749-023-00886-7.
Web of Science ®Google Scholar
Rabiner, L. (1989), “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, 77, 257–286. DOI: 10.1109/5.18626.
Web of Science ®Google Scholar
Rebafka, T., Roquain, É., and Villers, F. (2022), “Powerful Multiple Testing of Paired Null Hypotheses Using a Latent Graph Model,” Electronic Journal of Statistics, 16, 2796–2858. DOI: 10.1214/22-EJS2012.
Web of Science ®Google Scholar
Ren, Z., and Barber, R. F. (2023), “Derandomized Knockoffs: Leveraging e-Values for False Discovery Rate Control,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 86, 122–154. DOI: 10.1093/jrsssb/qkad085.
Web of Science ®Google Scholar
Ren, Z., and Candès, E. (2023), “Knockoffs with Side Information,” The Annals of Applied Statistics, 17, 1152–1174. DOI: 10.1214/22-AOAS1663.
Web of Science ®Google Scholar
Sarkar, S. K. (2002), “Some Results on False Discovery Rate in Stepwise Multiple Testing Procedures,” The Annals of Statistics, 30, 239–257. DOI: 10.1214/aos/1015362192.
Web of Science ®Google Scholar
Schervish, M. J. (2012), Theory of Statistics, New York: Springer.
Google Scholar
Sesia, M., Sabatti, C., and Candès, E. J. (2018), “Gene Hunting with Hidden Markov Model Knockoffs,” Biometrika, 106, 1–18. DOI: 10.1093/biomet/asy033.
PubMed Web of Science ®Google Scholar
Shafer, G. (2021), “Testing by Betting: A Strategy for Statistical and Scientific Communication,” Journal of the Royal Statistical Society, Series A, 184, 407–431. DOI: 10.1111/rssa.12647.
Web of Science ®Google Scholar
Shu, H., Nan, B., and Koeppe, R. (2015), “Multiple Testing for Neuroimaging via Hidden Markov Random Field,” Biometrics, 71, 741–750. DOI: 10.1111/biom.12329.
PubMed Web of Science ®Google Scholar
Storey, J. D. (2003), “The Positive False Discovery Rate: A Bayesian Interpretation and the q-value,” The Annals of Statistics, 31, 2013–2035. DOI: 10.1214/aos/1074290335.
Web of Science ®Google Scholar
Sun, W., and Cai, T. T. (2007), “Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control,” Journal of the American Statistical Association, 102, 901–912. DOI: 10.1198/016214507000000545.
Web of Science ®Google Scholar
—-(2009), “Large-Scale Multiple Testing Under Dependence,” Journal of the Royal Statistical Society, Series B, 71, 393–424. DOI: 10.1111/j.1467-9868.2008.00694.x.
Web of Science ®Google Scholar
Sun, W., Reich, B. J., Cai, T. T., Guindani, M., and Schwartzman, A. (2015), “False Discovery Control in Large-Scale Spatial Multiple Testing,” Journal of the Royal Statistical Society, Series B, 77, 59–83. DOI: 10.1111/rssb.12064.
Google Scholar
Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in a Random World (Vol. 29), New York: Springer.
Google Scholar
Vovk, V., and Wang, R. (2021), “E-values: Calibration, Combination and Applications,” The Annals of Statistics, 49, 1736–1754. DOI: 10.1214/20-AOS2020.
Web of Science ®Google Scholar
Wang, R., and Ramdas, A. (2022), “False Discovery Rate Control with e-values,” Journal of the Royal Statistical Society, Series B, 84, 822–852. DOI: 10.1111/rssb.12489.
Google Scholar
Wu, W. B. (2008), “On False Discovery Control Under Dependence,” The Annals of Statistics, 36, 364–380. DOI: 10.1214/009053607000000730.
Web of Science ®Google Scholar
Yang, C.-Y., Lei, L., Ho, N., and Fithian, W. (2021), “Bonus: Multiple Multivariate Testing with a Data-Adaptivetest Statistic,” arXiv preprint arXiv:2106.15743.
Google Scholar

False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-values

Abstract

1 Introduction

1.1 Model-Based and Model-Free FDR Methods

1.2 Non-Exchangeable Scores and Asymmetric Decision Rules