Full article: Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior’s parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online.

Keywords:

1 Introduction

The problem of estimating the number of unseen features generalizes the popular unseen-species problem (Orlitsky, Suresh, and Wu Citation2016), and its importance has grown dramatically in recent years, driven by applications in biological sciences (Ionita-Laza, Lange, and Laird Citation2009; Gravel Citation2014; Zou et al. Citation2016; Chakraborty et al. Citation2019). Consider a generic population in which each individual is endowed with a finite collection of $W$ -valued features, with $W$ possibly being an infinite space, and denote by p_i the probability that an individual has feature $w_{i} \in W$ for $i \geq 1$ . The unseen-features problem assumes $N \geq 1$ observable random samples $Z_{1 : N} = (Z_{1}, \dots, Z_{N})$ from the population, such that $Z_{n} = {(A_{n, i})}_{i \geq 1}$ are independent Bernoulli random variables with unknown parameters ${(p_{i})}_{i \geq 1}$ . Then, the goal is to estimate the number of hitherto unseen features that would be observed if $M \geq 1$ additional samples were collected, that is, $U = \sum_{i \geq 1} 1 (\sum_{n = 1}^{N} A_{n, i} = 0) 1 (\sum_{m = 1}^{M} A_{N + m, i} > 0),$ with $1$ being the indicator function. The unseen-species problem arises under the assumption that each individual is endowed with only one feature, that is, a species. A wide range of approaches have been developed to estimate U, including Bayesian methods (Ionita-Laza, Lange, and Laird Citation2009; Masoero et al. Citation2021), jackknife (Gravel Citation2014), linear programming (Zou et al. Citation2016), and variations of Good-Toulmin estimators (Orlitsky, Suresh, and Wu Citation2016; Chakraborty et al. Citation2019).

In biological sciences, we may think of individuals as organisms and of features as groups to which organisms belong to, with each group being defined by any difference in the genome relative to a reference genome, that is, a (genetic) variant. In human biology, the estimation of U arises in the context of optimal allocation of resources between quantity and quality in genetic experiments: spending resources to sequence a greater number of genomes (quantity), which reveals more about variation across the population, or spending resources to sequence genomes with increased accuracy (quality), which reveals more about individual organisms’ genomes. Accurate estimates of U are critical in the experimental pipeline toward the goal of maximizing the usefulness of experiments under the tradeoff between quantity and quality (Ionita-Laza and Laird Citation2010; Zou et al. Citation2016). While in human-biology the cost of sequencing has decreased in recent years (Schwarze et al. Citation2020), the expense remains nontrivial, and it is still critical in fields where scientists work with relatively budgets, for example, nonhuman and nonmodel organisms (Souza et al. Citation2017). Other applications arise in precision medicine (Momozawa and Mizukami Citation2020), microbiome analysis (Sanders et al. Citation2019), single-cell sequencing (Zhang, Ntranos, and Tse Citation2020) and wildlife monitoring (Johansson et al. Citation2020).

1.1 Our Contributions

We introduce a Bayesian nonparametric (BNP) approach to the unseen-features problem, which relies on a novel prior distribution for the unknown ${(p_{i})}_{i \geq 1}$ . Completely random measures (CRMs) (Kingman Citation1992) provide a broad class of nonparametric priors for feature sampling problems, the most popular being the stable-Beta process prior (James Citation2017; Broderick, Wilson, and Jordan Citation2018). In a recent work, Masoero et al. (Citation2021) brought out a peculiar feature of CRM priors in the unseen-features problem: they all lead to a Poisson posterior distribution of U, given $Z_{1 : N}$ and fixed prior’s parameters, which depends on $Z_{1 : N}$ only through the sample size N. Despite the broadness of the class of CRM priors, such a common Poisson posterior structure makes CRMs not a flexible prior model for the unseen-features problem. While the Poisson posterior distribution may be appealing in principle, making posterior inferences analytically tractable and easy to interpret, its independence from $Z_{1 : N}$ makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of the unknown prior’s parameters. A somehow similar scenario occurs in BNP inference for the unseen-species problem under a Dirichlet process (DP) prior (Ferguson Citation1973), and led to the use of the Pitman-Yor process (PYP) prior (Pitman and Yor Citation1997) for enriching the posterior distribution of the number of unseen species, while maintaining analytical tractability and interpretability of the DP prior (Lijoi, Mena, and Prünster Citation2007).

We show that scaled process (SP) priors, first introduced in James, Orbanz, and Teh (Citation2015), allow to enrich the posterior distribution of U arising under CRM priors. Under SP priors, we characterize the posterior distribution of U as a mixture of Poisson distributions that may include, through the mixing distribution, the whole sampling information in terms of the number of distinct features and their frequencies. While this is appealing in principle, it may be at stake with analytical tractability and interpretability, which are critical for a concrete use of SP priors. Then, we introduce the stable-Beta SP (SB-SP) prior, which provides a sensible tradeoff between the amount of sampling information introduced in the posterior distribution of U, and analytical tractability and interpretability of the posterior inferences. In particular, we characterize the SB-SP prior as the sole SP prior for which the posterior distribution of U, given $Z_{1 : N}$ and fixed prior’s parameters, depends on $Z_{1 : N}$ through the sample size N and the number K_N of distinct features; the SB-SP may thus be considered as the natural counterpart of the PYP for the unseen-feature problem. Under the SB-SP prior, the posterior distribution of U, as well as of a refinement of U that deals with the number of unseen rare features, is a negative Binomial posterior distributions, whose parameters depend on N, K_N and the prior’s parameters. Corresponding Bayesian estimates of U, with respect to a squared loss function, are simple, linear in K_N and computationally efficient.

We present an empirical validation of the effectiveness of our BNP methodology, both on synthetic and real data. As for real data, we consider cancer genomic data, where the goal is to estimate the number of new (genomic) variants to be discovered in future unobservable samples. In cancer genomics, accurate estimates of the number of new variants is of particular importance, as it might help practitioners understand the site of origin of cancers, as well as the clonal origin of metastasis, and in turn be a useful tool to develop effective clinical strategies (Chakraborty et al. Citation2019; Huyghe et al. Citation2019). We make use of data from the cancer genome atlas (TCGA), and focus on the challenging scenario in which the sample size N is particularly small, and also small with respect to the extrapolation size M. Such a scenario is of interest in genomic applications, where only few samples of rare cancer might be available. We show that our BNP methodology outperforms the most popular parametric and nonparametric competitors, both classical (frequentist) and Bayesian, in terms of estimation accuracy of U and a refinement of U for rare features. In addition, with respect to the BNP approach under the stable-Beta process prior (Masoero et al. Citation2021), our approach provides improved coverage for the estimation. This is an empirical evidence of the effectiveness of replacing the Poisson posterior distribution with the negative Binomial posterior distribution, which allows to better exploit the sampling information.

1.2 Organization of the Paper

In Section 2 we show how SP priors allow to enrich the posterior distribution of U arising under CRM priors. In Section 3 we introduce and investigate the SB-SP prior in the context of the unseen-features problem: (i) we characterize the SB-SP prior in the class of SP priors, providing its predictive distribution; (ii) we apply the SB-SP prior to the unseen-features problem, providing the posterior distribution of U and a BNP estimator. Section 4 contains illustrations of our method. In Section 5 we discuss our approach, a multivariate extension of it, and future research directions. Proofs and additional experiments are in the Appendix, supplementary materials.

2 Scaled Process Priors for Feature Sampling Problems

For a measurable space of features $W$ , we assume $N \geq 1$ observable individuals to be modeled as a random sample $Z_{1 : N}$ from the {0, 1}-valued stochastic process $Z (w) = \sum_{i \geq 1} A_{i} δ_{w_{i}} (w), w \in W$ , where ${(w_{i})}_{i \geq 1}$ are features in $W$ and ${(A_{i})}_{i \geq 1}$ are independent Bernoulli random variables with unknown parameters ${(p_{i})}_{i \geq 1}$ , p_i being the probability that an individual has feature w_i, for $i \geq 1$ . That is, Z is a Bernoulli process with parameter $ζ = \sum_{i \geq 1} p_{i} δ_{w_{i}}$ , denoted as $BeP (ζ)$ . BNP inference for feature sampling problems relies on the specification of a prior distribution on the discrete measure ζ, leading to the BNP-Bernoulli model, (1) $\begin{matrix} Z_{n} | ζ \overset{iid}{\sim} BeP (ζ) n = 1, \dots, N, \\ ζ \sim Z, \end{matrix}$ (1) namely ζ is a discrete random measure on $W$ whose law $Z$ takes on the interpretation of a prior distribution for the unknown feature’s composition of the population. By de Finetti’s theorem, the random variables Z_n’s in (1) are exchangeable with directing measure $Z$ (Aldous Citation1983). In this section, we show how SP priors for ζ (James, Orbanz, and Teh Citation2015) allow to enrich the posterior distribution of the number of unseen features arising under CRM priors.

2.1 CRM Priors for Bernoulli Processes

CRMs provide a standard tool to define nonparametric prior distributions on the parameter ζ of the Bernoulli process Z. Consider a homogeneous CRM μ₀ on $W$ , that is, $μ_{0} = \sum_{i \geq 1} ρ_{i} δ_{W_{i}}$ , where the ρ_i’s are (0, 1)-valued random atoms such that $\sum_{i \geq 1} ρ_{i} < + \infty$ , while the W_i’s are iid $W$ -valued random locations independent of the ρ_i’s. The law of μ₀ is characterized, through Lévy-Khintchine formula, by the Lévy intensity measure $ν_{0} (d s, d w) = λ_{0} (s) d s P (d w)$ on $(0, 1) \times W$ , where: (i) λ₀ is a measure on (0, 1), which controls the distribution of the ρ_i’s, and such that $\int_{(0, 1)} min {s, 1} λ_{0} (s) d s < + \infty$ ; ii) P is a nonatomic measure on $W$ , which controls the distribution of the W_i’s. For short, $μ_{0} \sim CRM (ν_{0})$ . See Appendix S1 for an account on CRMs (Kingman Citation1992, chap. 8). Note that, since P is nonatomic, the random atoms W_i’s are almost surely distinct, that is to say the different features cannot coincide almost surely. The law of μ₀ provides a natural prior distribution for the parameter ζ of the Bernoulli process. The Beta and the stable-Beta processes are popular examples of $μ_{0} \sim CRM (ν_{0})$ , for suitable specifications of ν₀. A comprehensive posterior analysis of CRM priors is presented in James (Citation2017). In the next proposition, we recall the predictive distribution of CRM priors (James Citation2017, Proposition 3.2).

Proposition 1.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim CRM (ν_{0})$ . If $Z_{1 : N}$ displays $K_{N} = k$ distinct features ${W_{1}^{*}, \dots, W_{K_{N}}^{*}}$ , each feature $W_{i}^{*}$ appearing exactly $M_{N, i} = m_{i}$ times, then the conditional distribution of $Z_{N + 1}$ , given $Z_{1 : N}$ , coincides with the distribution of (2) $Z_{N + 1} | Z_{1 : N} \overset{d}{=} Z_{N + 1}^{'} + \sum_{i = 1}^{K_{N}} A_{N + 1, i} δ_{W_{i}^{*}},$ (2) where: (i) $Z_{N + 1}^{'} | μ_{0}^{'} = \sum_{i \geq 1} A_{N + 1, i}^{'} δ_{W_{i}^{'}} \sim BeP (μ_{0}^{'})$ and $μ_{0}^{'} \sim CRM (ν_{0}^{'})$ , with $ν_{0}^{'} (d s, d w) = {(1 - s)}^{N} λ_{0} (s) d s P (d w)$ ; (ii) the $A_{N + 1, i}$ ’s are independent Bernoulli random variables with parameters J_i’s, such that J_i is distributed according to the density function $f_{J_{i}} (s) \propto {(1 - s)}^{N - m_{i}} s^{m_{i}} λ_{0} (s)$ for $i \geq 1$ .

According to (2), $Z_{N + 1}$ displays “new” features $W_{i}^{'}$ ’s, that is, features not appearing in the initial sample $Z_{1 : N}$ , and “old” features $W_{i}^{*}$ ’s, that is, features appeared in the initial sample $Z_{1 : N}$ . The posterior distribution of statistics of “new” features is determined by the law of $Z_{N + 1}^{'}$ , which depends on $Z_{1 : N}$ only through the sample size N; the posterior distribution of statistics of “old” features is determined by the law of $\sum_{1 \leq i \leq K_{N}} A_{N + 1, i} δ_{W_{i}^{*}}$ , which depends on $Z_{1 : N}$ through the sample size N, the number K_N of distinct features and their frequencies $(M_{N, 1}, \dots, M_{N, K_{N}})$ . As a corollary of Proposition 1, the posterior distribution of the number of “new” features in $(Z_{N + 1}, \dots, Z_{N + M})$ , given $Z_{1 : N}$ and fixed prior’s parameters, is a Poisson distribution that depends on $Z_{1 : N}$ only through N (Masoero et al. Citation2021). Such a posterior structure is peculiar to CRM priors, being inherited by the Poisson process formulation of CRMs (Kingman Citation1992). That is, despite the broadness of the class of CRM priors, all CRM priors lead to the same Poisson posterior structure for the number of unseen features, which thus makes them not a flexible prior model for the unseen-features problem. While the Poisson posterior distribution may be appealing in principle, making the posterior inferences analytically tractable and of easy interpretability, its independence from $Z_{1 : N}$ makes the BNP approach under CRM priors a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters.

Remark 1.

For the sake of mathematical convenience, and in agreement with the work of James (Citation2017), in the sequel we maintain the random measure formulation for both the prior model μ₀ and the Bernoulli processes Z_n. However, we point out that each Z_n is equivalently characterized by means of the Bernoulli variables ${(A_{n, i})}_{i \geq 1}$ and the random features ${(W_{i})}_{i \geq 1}$ . In other terms, there exits a one-to-one correspondence between Z_n and the sequence of points ${(A_{n, i}, W_{i})}_{i \geq 1}$ . Finally, note that, although the values of features’ labels W_i are immaterial, the features W_i’s are assumed to be random. This is in line with the BNP literature on species sampling models, where the species’ labels are assumed to be random (Pitman Citation1996).

2.2 SP Priors for Bernoulli Processes

Consider a homogeneous CRM $μ = \sum_{i \geq 1} τ_{i} δ_{W_{i}}$ on $W$ , where the τ_i’s are nonnegative and such that $\sum_{i \geq 1} τ_{i} < + \infty$ , and the W_i’s are iid and independent of the τ_i’s. We denote by $ν (d s, d w) = λ (s) d s P (d w)$ on $R_{+} \times W$ , with $\int_{R_{+}} min {s, 1} λ (s) d s < + \infty$ , the Lévy intensity measure of μ. Let $Δ_{1} > Δ_{2} > \dots$ be the decreasingly ordered τ_i’s, and consider the discrete random measure $μ_{Δ_{1}} = \sum_{i \geq 1} \frac{Δ_{i + 1}}{Δ_{1}} δ_{W_{i + 1}},$ such that $Δ_{i + 1} / Δ_{1} \in (0, 1)$ , for $i \geq 1$ , and $\sum_{i \geq 1} Δ_{i + 1} / Δ_{1} < + \infty$ . A SP on $W$ is defined from $μ_{Δ_{1}}$ as follows. Let $F_{Δ_{1}} (d a) = exp {- \int_{a}^{\infty} λ (s) d s} λ (a) d a$ be the distribution of $Δ_{1}$ (Ferguson and Klass Citation1972, p. 1636), and let G_a be the conditional distribution of ${(Δ_{i + 1} / Δ_{1})}_{i \geq 1}$ given $Δ_{1} = a$ . Moreover, let $Δ_{1, h}$ denote a random variable whose distribution has a density function $f_{Δ_{1, h}} (a) = h (a) f_{Δ_{1}} (a)$ , where h is a nonnegative function and $f_{Δ_{1}}$ is the density function of $F_{Δ_{1}}$ . If ${(ρ_{i})}_{i \geq 1}$ are (0, 1)-valued random variables with distribution $G_{Δ_{1, h}}$ then (3) $μ_{Δ_{1, h}} = \sum_{i \geq 1} ρ_{i} δ_{W_{i + 1}} .$ (3) is a SP. For short, $μ_{Δ_{1, h}} \sim SP (ν, h)$ . The law of $μ_{Δ_{1, h}}$ is a prior distribution for the parameter ζ of the Bernoulli process. The next proposition characterizes the predictive distribution of SP priors. See also (James, Orbanz, and Teh Citation2015, Proposition 2.2) for a posterior analysis of SP priors.

Proposition 2.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim SP (ν, h)$ . If $Z_{1 : N}$ displays $K_{N} = k$ distinct features ${W_{1}^{*}, \dots, W_{K_{N}}^{*}}$ , each feature $W_{i}^{*}$ appearing exactly $M_{N, i} = m_{i}$ times, then the conditional distribution of $Δ_{1, h}$ , given $Z_{1 : N}$ , has a density function of the form (4) $g_{Δ_{1, h} | Z_{1 : N}} (a) \propto \frac{\prod_{i = 1}^{k} \int_{0}^{1} s^{m_{i}} {(1 - s)}^{N - m_{i}} a λ (a s) d s}{exp {\sum_{n = 1}^{N} \int_{0}^{1} s {(1 - s)}^{n - 1} a λ (a s) d s}} f_{Δ_{1, h}} (a) .$ (4)

Moreover, the conditional distribution of $Z_{N + 1}$ , given $(Δ_{1, h}, Z_{1 : N})$ , coincides with the distribution of (5) $Z_{N + 1} | (Δ_{1, h}, Z_{1 : N}) \overset{d}{=} Z_{N + 1}^{'} + \sum_{i = 1}^{K_{N}} A_{N + 1, i} δ_{W_{i}^{*}},$ (5) where as (i) $Z_{N + 1}^{'} | μ_{Δ_{1, h}}^{'} = \sum_{i \geq 1} A_{N + 1, i}^{'} δ_{W_{i}^{'}} \sim BeP (μ_{Δ_{1, h}}^{'})$ and $μ_{Δ_{1, h}}^{'} | Δ_{1, h} \sim CRM (ν_{Δ_{1, h}}^{'})$ , with $ν_{Δ_{1, h}}^{'} (d s, d w) = {(1 - s)}^{N} Δ_{1, h} λ (s Δ_{1, h}) 1_{(0, 1)} (s) d s P (d w)$ ; (ii) the $A_{N + 1, i}$ ’s are independent Bernoulli random variables with parameters J_i’s, respectively, such that $J_{i} | Δ_{1, h}$ is distributed according to the density function $f_{J_{i} | Δ_{1, h}} (s) \propto {(1 - s)}^{N - m_{i}} s^{m_{i}} Δ_{1, h} λ (Δ_{1, h} s) 1_{(0, 1)} (s) d s$ for $i \geq 1$ .

See Appendix S2 for the proof of Proposition 2. The marginalization of (5) with respect to (4) leads to the predictive distribution of SP priors: (i) $Z_{N + 1}$ displays “new” features $W_{i}^{'}$ ’s, and the posterior distribution of statistics of “new” features, given $Z_{1 : N}$ , is determined by the law of $(Δ_{1, h}, Z_{N + 1}^{'})$ ; (ii) $Z_{N + 1}$ displays “old” features $W_{i}^{*}$ ’s, and the posterior distribution of statistics of “old” features, given $Z_{N + 1}$ , is determined by the law of $(Δ_{1, h}, \sum_{1 \leq i \leq K_{N}} A_{N + 1, i} δ_{W_{i}^{*}})$ . Because of (4) and (5), the law of $(Δ_{1, h}, Z_{N + 1}^{'})$ may include the whole sampling information, depending on the specification of ν and h, and hence the posterior distribution of statistics of “new” features, given $Z_{1 : N}$ , also includes such an information. As a corollary of Proposition 2, the posterior distribution of the number of unseen features, given $Z_{1 : N}$ and fixed prior’s parameters, is a mixture of Poisson distributions that may include the whole sampling information; in particular, the amount of sampling information in the posterior distribution is uniquely determined by the mixing distribution, namely by the conditional distribution of $Δ_{1, h}$ , given $Z_{1 : N}$ . SP priors thus allow to enrich the Poisson posterior structure arising from CRM priors, in terms of both a more flexible distribution and the inclusion of more sampling information than the sole sample size N, though they may lead to unwieldy posterior inferences due to the marginalization with respect to (4).

The use of the sampling information in the predictive structure of SPs somehow resembles that of Poisson-Kingman (PK) models (Pitman Citation2006). PK models form a broad class of nonparametric priors for species sampling problems. The DP prior is a PK model whose predictive distribution is such that: (i) the conditional probability that the $(N + 1)$ th draw is a “new” species, given N observable samples, depends only on the sample size; (ii) the conditional probability that the $(N + 1)$ th draw is an “old” species, given N observable samples, depends on the sample size, the number of distinct species and their frequencies. Such a behavior resembles that of CRM priors, that is, Proposition 1. PK models allow to include more sampling information in the probability of discovering a “new species” arising under the DP prior, which typically determines a loss of the analytical tractability of posterior inferences for the number of unseen species (Bacallado et al. Citation2017). Such a behavior resembles that of SP priors, that is, Proposition 2. The PYP prior is arguably the most popular PK model. It stands out for enriching the probability of discovering a “new” species arising under the DP prior, by including the sampling information on the number of distinct species, while maintaining the analytical tractability and interpretability of the DP prior.

3 Stable-Beta Scaled Process (SB-SP) Priors for the Unseen-Features Problem

In Section 2 we showed how SP priors allow to enrich the Poisson posterior structure of the number of unseen features arising under CRM priors, for example the Beta and the stable-Beta process priors. While this is an appealing property, it may lead to a lack of analytical tractability and interpretability of posterior inferences, thus, making SP priors not of practical interest in applications. In this section, we introduce and investigate a peculiar SP prior, which is referred to as the SB-SP prior, and we show that: (i) it leads to a negative Binomial posterior distribution for the number of unseen features, which generalizes the Poisson distribution while maintaining its analytical tractability and interpretability; (ii) it leads to a posterior distribution for the number of unseen features, which depends on the sampling through the sample size and the number of distinct features. The SB-SP prior thus provides a sensible tradeoff between the enrichment of the Poisson posterior structure of the number of unseen features arising under CRM priors and the analytical tractability and interpretability of posterior inferences. In particular, we characterize the SB-SP prior as the sole SP prior for which the posterior distribution of the number of unseen features depends on the observable sample only through the sample size and the number of distinct features. The SB-SP may thus be considered as a natural counterpart of the PYP for the unseen-feature problem.

3.1 SB-SP Priors for Bernoulli Processes

Stable scaled processes (S-SP) (James, Orbanz, and Teh Citation2015) form a subclass of SPs, and hence their definition follows from Section 2. In particular, for any $σ \in (0, 1)$ , let $μ_{σ}$ be the σ-stable CRM on $W$ (Kingman Citation1975), which is characterized by the Lévy intensity measure $ν_{σ} (d s, d w) = λ_{σ} (s) d s P (d w)$ on $R_{+} \times W$ , with $\int_{R_{+}} min {s, 1} λ_{σ} (s) d s < + \infty$ , where $λ_{σ} (s) = σ s^{- 1 - σ}$ . We recall that the largest atom $Δ_{1}$ of $μ_{σ}$ is distributed according to the density function (6) $f_{Δ_{1}} (a) = σ a^{- 1 - σ} exp {- a^{- σ}} .$ (6)

That is, $Δ_{1} = E^{- 1 / σ}$ , where E denotes a negative exponential random variable with parameter 1. For any nonnegative function h, a S-SP on $W$ is defined as the SP with law $SP (ν_{σ}, h)$ . S-SP priors generalizes the Beta process prior, which is recovered by setting h to be the identity function, and then letting $σ \to 0$ (James, Orbanz, and Teh Citation2015). The predictive distribution of $ζ \sim SP (ν_{σ}, h)$ is obtained from Proposition 2. In the next theorem, we characterize the S-SP priors as the sole SP priors for which the conditional distribution of $Δ_{1, h}$ , given $Z_{1 : N}$ , depends on $Z_{1 : N}$ only through the sample size N and the number K_N of distinct features in $Z_{1 : N}$ .

Theorem 1.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim SP (ν, h)$ , and let $Z_{1 : N}$ displays K_N distinct features with corresponding frequencies $(M_{N, 1}, \dots, M_{N, K_{N}})$ . Moreover, let $ν (d s, d w) = λ (s) d s P (d w)$ , and let $f_{Δ_{1, h}}$ be the density function of $Δ_{1, h}$ . If $f_{Δ_{1, h}} > 0$ on $R_{+}$ and the functions λ and $f_{Δ_{1, h}}$ are continuously differentiable, then the conditional distribution of $Δ_{1, h}$ , given $Z_{1 : N}$ , depends on $Z_{1 : N}$ only through N and K_N if and only if $ν = ν_{σ}$ .

See Appendix S3 for the proof of Theorem 1. We recall from Section 2 that the conditional distribution of $Δ_{1, h}$ , given $Z_{1, N}$ , uniquely determines the amount of sampling information included in the posterior distribution of statistics of “new” features. Then, according to Theorem 1, S-SP priors are the sole SP priors for which the posterior distribution of the number of unseen features, given $Z_{1 : N}$ and fixed prior’s parameters, depends on $Z_{1 : N}$ only through N and K_N. As a corollary of Theorem 1, the Beta process prior is the sole S-SP prior for which the posterior distribution of statistics of “new” features depends on $Z_{1 : N}$ only through N. Analogous predictive characterizations are well-known in species sampling problems, and they are typically referred to as “sufficientness” postulates’ (Bacallado et al. Citation2017). In particular, the DP prior is characterized as the sole species sampling prior for which the conditional probability that the $(N + 1)$ th draw is a “new” species, given N observable samples, depends only on the sample size (Regazzini Citation1978). Moreover, the PYP prior is characterized as the sole species sampling prior for which the conditional probability that the $(N + 1)$ th draw is a “new” species, given N observable samples, depends only on the sample size and the number of distinct species in the sample (Zabell Citation2005). Theorem 1 provides a “sufficientness” postulates’ in the context of feature sampling problems.

As a noteworthy example of S-SPs, we introduce the SB-SP. The SB-SP is a S-SP obtained by a suitable specification of the nonnegative function h. In particular, for any $c, β > 0$ let (7) $h_{c, β} (a) = \frac{β^{c + 1}}{Γ (c + 1)} a^{- c σ} exp {- (β - 1) a^{- σ}},$ (7) where $Γ (\cdot)$ denotes the Gamma function. Then a SB-SP on $W$ is defined as the SP with law $SP (ν_{σ}, h_{c, β})$ . For short, we denote the law of a SB-SP by $SB - SP (σ, c, β)$ . The SB-SP prior generalizes the Beta process prior, which is recovered by setting c = 0 and β = 1, and then letting $σ \to 0$ . According to the construction of SPs, the distribution of $Δ_{1, h_{c, β}}$ has a density function obtained by combining (6) and (7); this is a polynomial-exponential tilting of the density function (6). In particular, $Δ_{1, h_{c, β}}^{- σ}$ is distributed as a Gamma distribution with shape $(c + 1)$ and rate β. Such a straightforward distribution for $Δ_{1, h_{c, β}}$ is at the core of the analytical tractability of posterior inferences under the SB-SP prior; this fact will be clear in the application of the SB-SP prior to the problem of estimating the number of unseen features. The next proposition characterizes the predictive distribution of the SB-SP prior.

Proposition 3.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim SB - SP (σ, c, β)$ . If $Z_{1 : N}$ displays $K_{N} = k$ distinct features ${W_{1}^{*}, \dots, W_{K_{N}}^{*}}$ , each feature $W_{i}^{*}$ appearing exactly $M_{N, i} = m_{i}$ times, then the conditional distribution of $Δ_{1, h_{c, β}}$ , given $Z_{1 : N}$ , has a density function of the form (8) $\begin{matrix} g_{Δ_{1, h_{c, β}} | Z_{1 : N}} (a) \\ = σ \frac{{(β + γ_{0}^{(N)})}^{k + c + 1}}{Γ (k + c + 1)} a^{- k σ - (c + 1) σ - 1} e^{- a^{- σ} (β + γ_{0}^{(N)})}, \end{matrix}$ (8) where $γ_{0}^{(N)} = σ \sum_{1 \leq i \leq N} B (1 - σ, i)$ , with $B (\cdot, \cdot)$ being the (Euler) Beta function. Moreover, the conditional distribution of $Z_{N + 1}$ , given $(Δ_{1, h_{c, β}}, Z_{1 : N})$ , coincides with the distribution of (9) $Z_{N + 1} | (Δ_{1, h_{c, β}}, Z_{1 : N}) \overset{d}{=} Z_{N + 1}^{'} + \sum_{i = 1}^{K_{N}} A_{N + 1, i} δ_{W_{i}^{*}},$ (9) where:

$Z_{N + 1}^{'} | μ_{Δ_{1, h_{c, β}}}^{'} = \sum_{i \geq 1} A_{N + 1, i}^{'} δ_{W_{i}^{'}} \sim BeP (μ_{Δ_{1, h_{c, β}}}^{'})$ such that $μ_{Δ_{1, h_{c, β}}}^{'}$ $| Δ_{1, h_{c, β}} \sim CRM (ν_{Δ_{1, h_{c, β}}}^{'})$ , with $\begin{matrix} ν_{Δ_{1, h_{c, β}}}^{'} (d s, d w) \\ = Δ_{1, Δ_{1, h_{c, β}}}^{- σ} {(1 - s)}^{N} σ s^{- 1 - σ} 1_{(0, 1)} (s) d s P (d w); \end{matrix}$
the $A_{N + 1, i}$ ’s are independent Bernoulli random variables with parameters J_i’s, respectively, such that each $J_{i} | Δ_{1, h_{c, β}}$ is distributed according to a density function of the form $\begin{matrix} f_{J_{i} | Δ_{1, h_{c, β}}} (s) \\ = \frac{1}{B (m_{i} - σ, N - m_{i} + 1)} s^{m_{i} - σ} {(1 - s)}^{N - m_{i} + 1} 1_{(0, 1)} (s) . \end{matrix}$

See Appendix S3 for the proof of Proposition 3. According to EquationEquation (8)(8) $\begin{matrix} g_{Δ_{1, h_{c, β}} | Z_{1 : N}} (a) \\ = σ \frac{{(β + γ_{0}^{(N)})}^{k + c + 1}}{Γ (k + c + 1)} a^{- k σ - (c + 1) σ - 1} e^{- a^{- σ} (β + γ_{0}^{(N)})}, \end{matrix}$ (8) , the conditional distribution of $Δ_{1, h_{c, β}}$ , given $Z_{1 : N}$ , depends on $Z_{1 : N}$ only through the sample size N and the number K_N of distinct features in $Z_{1 : N}$ . This agrees with Theorem 1, implying that the posterior distribution of the number of unseen features, given $Z_{1 : N}$ and fixed prior’s parameters, depends on $Z_{1 : N}$ only through N and K_N. Because of (8) and (9), the posterior distribution of statistics of “new” features stands out for analytical tractability, thus, being competitive with that arising from CRMs, for example, the Beta and the stable-Beta processes. In particular, from EquationEquation (9)(9) $Z_{N + 1} | (Δ_{1, h_{c, β}}, Z_{1 : N}) \overset{d}{=} Z_{N + 1}^{'} + \sum_{i = 1}^{K_{N}} A_{N + 1, i} δ_{W_{i}^{*}},$ (9) , the conditional distribution of $Z_{N + 1}^{'}$ , given $(Δ_{1, h_{c, β}}, Z_{1 : N})$ is a Poisson distribution that depends on $Z_{1 : N}$ only through N. Then, from (8), its marginalization with respect to the conditional distribution of $Δ_{1, h_{c, β}}$ , given $Z_{1 : N}$ , leads to a negative Binomial posterior distribution. Such an appealing property arises from the peculiar form $h_{c, β}$ that, combined with $ν_{σ}$ , leads to a conjugacy property for the conditional distribution of $Δ_{1, h_{c, β}}$ , given $Z_{1 : N}$ . That is, the conditional distribution of $Δ_{1, h_{c, β}}^{- σ}$ , given $Z_{1 : N}$ , is a Gamma distribution with shape $(K_{N} + c + 1)$ and rate $β + γ_{0}^{(N)}$ , which is the distribution $Δ_{1, h_{c, β}}^{- σ}$ with shape and rate being updated through $Z_{1 : N}$ . The next proposition establishes the distribution of a random sample $Z_{1 : N}$ from a SB-SP prior. See Appendix S3 for details.

Proposition 4.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim SB - SP (σ, c, β)$ . The probability that $Z_{1 : N}$ displays a particular feature allocation of k distinct features with frequencies $(m_{1}, \dots, m_{k})$ is (10) $\begin{matrix} p_{k}^{(N)} (m_{1}, \dots, m_{k}) = \frac{\frac{σ^{k} β^{c + 1}}{{(β + γ_{0}^{(N)})}^{k + c + 1}}}{\frac{Γ (c + 1)}{Γ (k + c + 1)}} \prod_{i = 1}^{k} \\ \times \frac{Γ (m_{i} - σ) Γ (N - m_{i} + 1)}{Γ (N - σ + 1)} . \end{matrix}$ (10)

3.2 BNP Inference for the Unseen-Features Problem

Now, we apply the SB-SP prior to the unseen-features problem. For any $N \geq 1$ let $Z_{1 : N}$ be an observable sample modeled as the BNP Bernoulli model (1), with $ζ \sim SB - SP (σ, c, β)$ . Moreover, under the same model of the Z_n’s, for any $M \geq 1$ let $(Z_{N + 1}, \dots, Z_{N + M})$ be additional unobservable sample. Then, the unseen-feature problem calls for the estimation of (11) $U_{N}^{(M)} = \sum_{i \geq 1} 1 (\sum_{m = 1}^{M} A_{N + m, i} > 0) 1 (\sum_{n = 1}^{N} A_{n, i} = 0),$ (11) namely the number of hitherto unseen features that would be observed in $(Z_{N + 1}, \dots, Z_{N + M})$ . As generalization of the unseen-feature problem (11), for $r \geq 1$ we consider the estimation of (12) $U_{N}^{(M, r)} = \sum_{i \geq 1} 1 (\sum_{m = 1}^{M} A_{N + m, i} = r) 1 (\sum_{n = 1}^{N} A_{n, i} = 0),$ (12) namely the number of hitherto unseen features that would be observed with prevalence r in $(Z_{N + 1}, \dots, Z_{N + M})$ . Of special interest is r = 1, which concerns rare (unique) features. The next theorem characterizes the posterior distributions of $U_{N}^{(M)}$ and $U_{N}^{(M, r)}$ , given $Z_{1 : N}$ . We denote by $NegativeBinonial (n, p)$ the negative Binomial distribution with parameter n and $p \in (0, 1)$ .

Theorem 2.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim SB - SP (σ, c, β)$ , and let $Z_{1 : N}$ displays $K_{N} = k$ distinct features with frequencies $(M_{N, 1}, \dots, M_{N, K_{N}}) = (m_{1}, \dots, m_{k})$ . Then, the posterior distributions of $U_{N}^{(M)}$ and of $U_{N}^{(M, r)}$ , given $Z_{1 : N}$ , coincide with the distributions of (13) $U_{N}^{(M)} | Z_{1 : N} \sim NegativeBinonial (K_{N} + c + 1, \frac{γ_{N}^{(M)}}{β + γ_{0}^{(N + M)}}),$ (13) and (14) $\begin{matrix} U_{N}^{(M, r)} | Z_{1 : N} \sim NegativeBinonial \\ \times (K_{N} + c + 1, \frac{ρ_{N}^{(M, r)}}{β + γ_{0}^{(N)} + ρ_{N}^{(M, r)}}), \end{matrix}$ (14) for any index of prevalence $r \geq 1$ , respectively, where $γ_{N}^{(M)} = σ \sum_{1 \leq i \leq M} B (1 - σ, N + i)$ and where $ρ_{N}^{(M, r)} = (\begin{matrix} M \\ r \end{matrix}) σ B (r - σ, N + M - r + 1)$ , with $B (\cdot, \cdot)$ denoting the (Euler) Beta function.

See Appendix S4 for the proof of Theorem 2. The posterior distributions (13) and (14) depend on $Z_{1 : N}$ through the sample size N and the number K_N of distinct features. This is in contrast with the corresponding posterior distributions obtained under the Beta and the stable-Beta process priors, which are Poisson distributions that depend on $Z_{1 : N}$ only through N (Masoero et al. Citation2021, Proposition 1). BNP estimators of $U_{N}^{(M)}$ and $U_{N}^{(M, r)}$ , with respect to a squared loss function, are obtained as the posterior expectations of (13) and (14), that is, (15) ${\hat{U}}_{N}^{(M)} = (K_{N} + c + 1) \frac{γ_{N}^{(M)}}{β + γ_{0}^{(N + M)} - γ_{N}^{(M)}}$ (15) and (16) ${\hat{U}}_{N}^{(M, r)} = (K_{N} + c + 1) \frac{ρ_{N}^{(M, r)}}{β + γ_{0}^{(N)}},$ (16) respectively. The estimators (15) and (16) are simple, linear in the sampling information and computationally efficient. In the next theorem we establish the large M asymptotic behavior of the posterior distributions (13) and (14), showing that the number of unseen features has a power-law growth in M. The same growth in M holds under the stable-Beta process prior (Masoero et al. Citation2021, Proposition 2), though the limiting distribution is degenerate.

Theorem 3.

Let $Z_{1 : N}$ be a random sample from (1) with $ζ \sim SB - SP (σ, c, β)$ , and let $Z_{1 : N}$ displays $K_{N} = k$ distinct features with frequencies $(M_{N, 1}, \dots, M_{N, K_{N}}) = (m_{1}, \dots, m_{k})$ . As $M \to + \infty$ (17) $\frac{U_{N}^{(M)}}{M^{σ}} ⏧ Z_{1 : N} \overset{a . s .}{\to} W_{N},$ (17) where W_N is a Gamma random variable with shape $(K_{N} + c + 1)$ and rate $(β + γ_{0}^{(N)}) / Γ (1 - σ)$ , and (18) $\frac{U_{N}^{(M, r)}}{M^{σ}} ⏧ Z_{1 : N} \overset{a . s .}{\to} W_{N, r},$ (18) where $W_{N, r}$ is a Gamma random variable with shape $(K_{N} + c + 1)$ and rate $Γ (r + 1) (β + γ_{0}^{(N)}) / σ Γ (r - σ)$ .

4 Experiments

Over the last decade, genomics has witnessed an extraordinary improvement in the data availability due to the advent of next generation sequencing technologies. Thanks to larger and richer datasets, researchers have started uncovering the role and impact of rare genetic variants in heritability and human disease (Hernandez et al. Citation2019; Momozawa and Mizukami Citation2020). The development of methods for estimating the number of new genomic variants to be observed in future studies is an active research area, as it can aid the design of effective clinical procedures in precision medicine (Ionita-Laza, Lange, and Laird Citation2009; Zou et al. Citation2016), enhance understanding of cancer biology (Chakraborty et al. Citation2019), and help to optimize sequencing procedures (Rashkin et al. Citation2017; Masoero et al. Citation2021). Here, we consider datasets of individual genomic sequences. Following common practice, we assume that an underlying fixed and idealized genomic sequence (the “reference”) is given. Then, each coordinate of an individual sequence reports the presence (1) or absence (0) of variation at a given locus with respect to the reference. All variants are treated equally, namely, any expression differing from the underlying reference at a given locus counts as a variant. We make use of our methodology to estimate the number of genomic loci at which variation was not observed in the original sample, and is going to be observed in (at least one of) M additional datapoints.

We find in our experiments that the estimates of the total number of new variants to be observed produced using the SB-SP-Bernoulli model, hereafter referred to as SSB, tend to be more accurate than other available methods in the literature. This phenomenon is particularly evident when the sample size N of the training set is small, and when the extrapolation size M is large with respect to N. Moreover, the SSB model is particularly effective in estimating the number of new rare variants, for example, variants appearing only once in the additional unobservable samples. Accurate estimation of rare variants is particularly important, as these are believed to be largely responsible for heritability of human disease (Rashkin et al. Citation2017; Chakraborty et al. Citation2019). To benchmark the quality of the SSB, we consider a number of competing methodologies for the feature prediction problem available in the literature: (i) Jackknife estimators (J) (Gravel Citation2014); (ii) a linear programming method (LP) (Zou et al. Citation2016) and variations of Good-Toulmin estimators (GT) (Chakraborty et al. Citation2019). We also compare our empirical findings to a BNP estimator obtained under the stable-Beta process prior (3BB), which has been introduced in Masoero et al. (Citation2021). We complete our analysis with a thorough investigation on synthetic data in S6 and S7, as well as on additional real data from the gnomAD database (Karczewski et al. Citation2020) in S8. https://github.com/lorenzomasoero/ScaledProcesses contains code and data to replicate all our analyses.

4.1 Empirics and Evaluation Metrics

For the SSB method to be useful, we need to estimate the underlying, unknown, parameters of the SB-SP prior. To learn these prior’s parameters, we here adopt an empirical Bayes procedure, which consists in maximizing the marginal distribution (10). In particular, we maximize numerically EquationEquation (10)(10) $\begin{matrix} p_{k}^{(N)} (m_{1}, \dots, m_{k}) = \frac{\frac{σ^{k} β^{c + 1}}{{(β + γ_{0}^{(N)})}^{k + c + 1}}}{\frac{Γ (c + 1)}{Γ (k + c + 1)}} \prod_{i = 1}^{k} \\ \times \frac{Γ (m_{i} - σ) Γ (N - m_{i} + 1)}{Γ (N - σ + 1)} . \end{matrix}$ (10) with respect to the parameters $β > 0$ , c > 0 and $σ \in (0, 1)$ of the SB-SP prior, and use the resulting values to produce our estimators. That is, we let $(\hat{β}, \hat{c}, \hat{σ}) = arg max_{(β, c, σ)} {p_{k}^{(N)} (m_{1}, \dots, m_{k})},$ and plug these values in the BNP estimator (13) and (14). The resulting values provide our BNP estimates of the number $U_{N}^{(M)}$ of new variants and the number $U_{N}^{(M, r)}$ of new variants with prevalence r.

To assess the accuracy of our estimates, we consider the percent deviation of the estimate from the truth to be the achieved accuracy. That is, the accuracy of the estimator ${\hat{U}}_{N}^{(M)}$ is defined as (19) $v_{N}^{(M)} : = 1 - min {\frac{| U_{N}^{(M)} - {\hat{U}}_{N}^{(M)} |}{U_{N}^{(M)}}, 1} .$ (19)

In particular, the accuracy $v_{N}^{(M)}$ equals 1 when the estimate is perfect (no error is incurred), and decreases to 0 as the estimate deviates from the truth. The $min$ operator in (19) ensures that $v_{N}^{(M)}$ lies in $[0, 1]$ : we let the accuracy to be equal to 0 whenever there is a severe overestimation, and the percentage estimation error exceeds 100%, that is, when ${\hat{U}}_{N}^{(M)} \geq 2 \times U_{N}^{(M)}$ . The SSB, 3BB and LP methods also offer an estimate for the number of new features observed with a given prevalence r. We let $v_{N}^{(M, r)}$ be the accuracy metric, where we replace in (19) $U_{N}^{(M)}$ with $U_{N}^{(M, r)}$ , the number of new features observed with prevalence r, and ${\hat{U}}_{N}^{(M)}$ with ${\hat{U}}_{N}^{(M, r)}$ .

4.2 Estimating the Number of New Variants in Cancer Genomics

Following the empirical study of Chakraborty et al. (Citation2019), we make use of data from the Cancer Genome Atlas (TCGA), the largest publicly available cancer genomics dataset, containing somatic mutations from $10, 295$ patients and spanning 33 different cancer types. We partition the samples into 33 smaller datasets according to cancer-type annotation of each patient. See Chakraborty et al. (Citation2019) and (Masoero et al. Citation2021, Appendix F) for details on the data and the experimental setup. For each cancer type, we retain a small fraction of the data for purposes of training, and consider the task of estimating the number of new variants that will be observed in a follow-up sample given a pilot sample. We validate our estimates by comparing the estimate ${\hat{U}}_{N}^{(M)}$ of the number of distinct variants to the true value, obtained by extrapolating to the remaining data. To assess the variability and error in our estimates, we repeat for every cancer type the experiment on S = 1000 subsets of the data, each obtained by randomly subsampling without replacement from the full sample.

We find that the SSB and 3BB methods perform particularly well when the training sample size N is small compared to the extrapolation sample size M. This setting is relevant in the context of cancer genomics, as scientists are interested in understanding the “unexploited potential” of the genetic information, especially for rare cancer subtypes (Chakraborty et al. Citation2019; Huyghe et al. Citation2019). To compare and quantify the performance of the available methodologies in this setting, we report in the distribution of the estimation accuracy when retaining only N = 10 samples for training and extrapolating to the largest possible sample size M for which we can compute the accuracy metric (EquationEquation (19)(19) $v_{N}^{(M)} : = 1 - min {\frac{| U_{N}^{(M)} - {\hat{U}}_{N}^{(M)} |}{U_{N}^{(M)}}, 1} .$ (19) ). We report results for the 10 cancer types with the largest number of samples in the original dataset. For each cancer type and for each method, the distribution of the estimation accuracy is obtained by considering its performance across the S = 1000 replicates. Across all cancer types, the estimates obtained from the SSB method achieve higher accuracy.

Fig. 1 Estimation accuracy $v_{N}^{(M)}$ for the number of new genomic variants ${\hat{U}}_{10}^{(M)}$ . For each method and each cancer type, we retain N = 10 random samples and use them to estimate up to M total observations, where N + M is the size of the original sample.

Fig. 1 Estimation accuracy vN(M) for the number of new genomic variants Û10(M). For each method and each cancer type, we retain N = 10 random samples and use them to estimate up to M total observations, where N + M is the size of the original sample.

We show in the behavior of ${\hat{U}}_{N}^{(i)}$ for five different cancer types as $i = 1, \dots, M$ . Again, we let N = 10, and M be the largest possible extrapolation value, as dictated by the dataset size. We report the estimates obtained from a fixed sample of size N = 10, as well as the variability around such estimates obtained by refitting each model, iteratively leaving one datapoint out from the sample. In this setting, the SSB method outperforms competing methods in terms of estimation accuracy. Moreover, the variability in the estimates arising from refitting the model on subsets of the data provides a useful measure of uncertainty in such estimation.

Fig. 2 Estimation of the number of new genomic variants ${\hat{U}}_{N}^{(i)}$ , for $i = 1, \dots, M$ . For each method and cancer type, we retain N = 10 random samples and use them to estimate up to the largest possible size. We fit each model on the full sample, as well as N = 10 additional times by iteratively leaving one datapoint out from the training sample. The solid black line is the true number of features that would have been observed (vertical axis) for any extrapolation size N + M (horizontal axis), for a fixed ordering of the data. Shaded regions report the prediction range obtained from the estimates from the leave-one-out fits.

Fig. 2 Estimation of the number of new genomic variants ÛN(i), for i=1,…,M. For each method and cancer type, we retain N = 10 random samples and use them to estimate up to the largest possible size. We fit each model on the full sample, as well as N = 10 additional times by iteratively leaving one datapoint out from the training sample. The solid black line is the true number of features that would have been observed (vertical axis) for any extrapolation size N + M (horizontal axis), for a fixed ordering of the data. Shaded regions report the prediction range obtained from the estimates from the leave-one-out fits.

4.3 Estimating the Number of New Rare Variants in Cancer Genomics

In recent years, the cancer genomics research community has become increasingly interested in studying and understanding the role of extremely rare variants, such as singletons, that is, observed in only one patient. Evidence suggests that rare deleterious variants can have far stronger effect sizes than common variants (Rasnic, Linial, and Linial Citation2020) and can play an important role in the development of cancer. For example, in breast cancer, it is well accepted that the risk of a variant is inversely proportional with respect to its prevalence: the rarer the variant, the higher the risk (Wendt and Margolin Citation2019). Therefore, effective identification and discovery of rare variants is an active, is an ongoing research area (Lawrenson et al. Citation2016; Lee et al. Citation2019). This phenomenon is not limited to breast cancer, but is progressively being studied across different cancer types. See, for example, the recent works on ovarian (Phelan et al. Citation2017), skin (Goldstein et al. Citation2017), prostate (Nguyen-Dumont et al. Citation2020) and lung (Liu et al. Citation2021) cancers and references therein. In downstream analysis, these estimates could be useful for planning and designing future experiments, for example, informing scientists on the number of new samples to be collected in order to observe a target number of new variants, or for power analysis considerations in rare variants association tests (Rashkin et al. Citation2017).

The BNP framework considered here allows us to estimate the number of new rare variants to be discovered (). While Zou et al. (Citation2016) did not consider the problem of estimating rare variants, it is straightforward to obtain an estimate for this quantity from their framework. Indeed, for every prevalence $x \in [0, 1]$ , the LP estimates the histogram h(x), which counts the number of variants appearing with prevalence x in the population, and the number of variants appearing with prevalence r follows from the binomial sampling model assumption, namely ${\hat{U}}_{N}^{(M, r)} = \sum_{x} h (x) {(\begin{matrix} N + M \\ r \end{matrix}) x^{r} {(1 - x)}^{N + M - r} - (\begin{matrix} N \\ r \end{matrix}) x^{r} {(1 - x)}^{N - r}}$ . We show in that the SSB method provides better estimates than the 3BB and LP methods.

Fig. 3 Estimation accuracy $v_{N}^{(M, 1)}$ for new variants appearing with prevalence one in future unobservable samples for different cancer types. For each method and each cancer, we retain N = 10 random samples and use them to estimate up to the largest possible size.

Fig. 4 Estimation accuracy $v_{N}^{(M, 1)}$ for new variants appearing with prevalence one in future samples. For each method and different cancer types, we retain a random sample of size $N = 5 %$ of the available dataset, and use it to estimate up to the largest possible size.

4.4 Coverage and Calibrated Uncertainties

One of the benefits of the BNP approach is that it automatically yields a notion of variability of the estimate of U via posterior credible intervals. We here check whether these intervals produce a useful notion of uncertainty, by investigating their calibration. For $α \in (0, 1)$ , we say that a $100 \times α %$ credible interval is calibrated if it contains the true value of interest, arising from hypothetical repeated draws, $100 \times α %$ of the times. We here assess the calibration of a $100 \times α %$ credible interval for $U_{N}^{(M)}$ conditionally given $Z_{1 : N}$ as follows. Let S be a large number (S = 1000 in our experiments). For each $s = 1, \dots, S$ , we retain a random subset of the data of size N, and estimate the corresponding parameters $\hat{β}, \hat{c}, \hat{σ}$ as discussed in Section 4.1. Then, we let ${\hat{W}}_{N, s, low}^{(M)} (α), {\hat{W}}_{N, s, hi}^{(M)} (α)$ be the endpoints of a $100 \times α %$ credible interval for the distribution of the number of new features, as given by EquationEquation (13)(13) $U_{N}^{(M)} | Z_{1 : N} \sim NegativeBinonial (K_{N} + c + 1, \frac{γ_{N}^{(M)}}{β + γ_{0}^{(N + M)}}),$ (13) , centered around the posterior predictive mean. We compute coverage calibration via $w_{N}^{(M)} (α) = \frac{1}{S} \sum_{s = 1}^{S} 1 {{\hat{W}}_{N, s, low}^{(M)} (α) \leq K_{N + M} \leq {\hat{W}}_{N, s, hi}^{(M)} (α)} .$

This is the fraction of the S experiments in which the true value was contained by a $100 \times α %$ credible interval. The closer $w_{N}^{(M)} (α)$ to α, the better calibrated the credible intervals. We compute the same quantity for the 3BB method using the results in Masoero et al. (Citation2021). Although still not perfect, we find that the posterior predictive intervals obtained from the SSB method are better calibrated than the ones under the 3BB method (see ).

Fig. 5 Coverage calibration of BNP estimators for number of new variants in future samples across all cancer types in TCGA. Different subplots refer to different ratios of the training N with respect to the extrapolation M. For each cancer, we retain a training sample of size $N \in {5 %, 10 %, 20 %, 30 %}$ of the total available dataset, and extrapolate up to the largest available M. Colored lines report the average coverage $w_{N}^{(M)} (α)$ across all cancer types (y-axis) as a function of α (x-axis). Faded dots refer to coverage for individual cancer types.

5 Discussion

Masoero et al. (Citation2021) first applied CRM priors to the unseen-features problem, showing that: (i) despite the broadness of the class of CRM priors, all CRM priors lead to the same Poisson posterior structure for the number of unseen features, which thus makes them not a flexible prior model for the unseen-features problem; (ii) while the Poisson posterior distribution may be appealing in principle, making the posterior inferences analytically tractable and of easy interpretability, its independence from $Z_{1 : N}$ makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduced the SB-SP prior, and showed that: (i) it enriches the posterior distribution of the number of unseen features arising under CRM priors, which results in a negative Binomial distribution whose parameters depend on the sample size and the number of distinct features; (ii) it maintains the same analytical tractability and interpretability as CRM priors, which results in BNP estimators that are simple, linear in the sampling information and computationally efficient. The effectiveness of the SB-SP prior is showcased through an empirical analysis on synthetic and real data. Under the SB-SP prior, we found that estimates of the unseen number of features are accurate, and they outperform the most popular competitors in the challenging scenario where the sample size N is particularly small, and also small with respect to the extrapolation size M.

Our approach admits an extension to the multiple-feature setting, which takes into account of the many forms of variation, for example, single nucleotide changes, tandem repeats, insertions and deletions, copy number variations (Zou et al. Citation2016). We briefly describe the multiple-feature setting, and defer to Appendix S5 for details. It is assumed that a feature w_i comes with a characteristic, that is, the form of variation, chosen among q > 1 characteristics. For $N \geq 1$ , the observable sample $Z_{1 : N} = (Z_{1}, \dots, Z_{N})$ is modeled as a ${0, 1}^{q}$ -valued stochastic process $Z = \sum_{i \geq 1} A_{i} δ_{w_{i}}$ , where $A_{i} : = (A_{i, 1}, \dots, A_{i, q})$ is a Multinomial random variable with parameter $p_{i} = (p_{i, 1}, \dots, p_{i, q})$ such that $| p_{i} | = \sum_{1 \leq j \leq q} p_{i, j} < 1$ , and the $A_{i}$ ’s are iid. That is, for any $i \geq 1$ all the $A_{i, j}$ ’s are equal to 0 with probability $(1 - | p_{i} |)$ , that is, w_i does not display variation, or only one $A_{i, j}$ ’s is equal to 1 with probability $p_{i, j}$ , that is, w_i displays variation with characteristic j. Z is a multivariate Bernoulli process with parameter $ζ = \sum_{i \geq 1} p_{i} δ_{w_{i}}$ . The stable-Beta-Dirichlet process prior for $ζ$ is a multivariate generalization of the stable-Beta process prior (James Citation2017), and it leads to a Poisson posterior distribution for the number of unseen features, given $Z_{1 : N}$ , which depends on $Z_{1 : N}$ only through N. In Appendix S5 we introduce a scaled version of the stable-Beta-Dirichlet process, and show that it leads to a negative Binomial posterior distribution for the number of unseen features, which depends on $Z_{1 : N}$ through N and the number of distinct features in $Z_{1 : N}$ .

SP priors have been introduced in James, Orbanz, and Teh (Citation2015) and, to the best of our knowledge, since then no other works have further investigated such a class of priors. To date, the peculiar predictive properties of SP priors appear to be unknown in the BNP literature. Our work on the unseen-features problem is the first to highlight the great potential of SP priors in BNPs, showing that they provide a critical tool for enriching the predictive structure of the popular CRM priors (James Citation2017; Broderick, Wilson, and Jordan Citation2018). We believe that SPs may be of interest beyond the unseen-features problem, and more generally beyond the broad class of feature sampling problems. CRM priors, and in particular the Beta and stable-Beta process priors, have been widely used in several contexts, with a broad range of applications in topic modeling, analysis of social networks, binary matrix factorization for dyadic data, analysis of choice behavior arising from psychology and marketing surveys, graphical models, and analysis of similarity judgment matrices. See Griffiths and Ghahramani (Citation2011) and references therein for details. In all these contexts, SP priors may be more effective than CRM priors, as they allow to better exploit the sampling information in posterior inferences.

Among applications of SP priors beyond features sampling problems, it is worth mentioning the use of SP priors as hierarchical (or latent) priors in models of unsupervised learning (Griffiths and Ghahramani Citation2011, sec. 5), the most popular being Gaussian latent feature modeling. Differently from features sampling problems, where the values of features’ labels W_is are immaterial, in Gaussian latent feature modeling the values the W_i’s become material. That is, under the Gaussian latent feature model with a SP prior, observations are assumed to modeled as a multivariate Gaussian distribution, whose mean depends on latent features that are modeled with a SP prior, thus, making the values of features’ labels W_i’s of critical importance for the analysis. Bayesian factor analysis (Knowles and Ghahramani Citation2011) provides another context where SP priors may be usefully applied as hierarchical priors. Within the context of factor analysis, we also mention the work of Ayed and Caron (Citation2021) with applications to network analysis. There, the authors exploit CRM priors to recover the latent community structure in a network between individuals, and the features’ labels describe the level of affiliation of a certain individual to a latent community. In such a context, we believe that SP priors may be used in place of CRM priors, with the advantage of introducing richer predictive structure. In this respect, our work paves the way to promising directions of future research, in terms of both methods and applications.

Supplemental material

Supplemental Material

Download Zip (90.7 MB)

Supplemental Material

Download MS Word (44.8 KB)

Acknowledgments

The authors are grateful to the Editor, the Associate Editor, and two anonymous Referees for their comments and suggestions, which greatly improved the contents and presentation of the article. The authors thank Joshua Schraiber for useful discussions.

Supplementary Materials

All proofs of our statements, as well as additional experiments can be found in the Appendix. Code to replicate all our analyses is available at https://github.com/lorenzomasoero/ScaledProcesses/.

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

Federico Camerlenghi and Stefano Favaro received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 817257. Federico Camerlenghi and Stefano Favaro gratefully acknowledge the financial support from the Italian Ministry of Education, “Dipartimenti di Eccellenza” grant 2018-2022. Lorenzo Masoero and Tamara Broderick were supported in part by the DARPA I2O LwLL program, an NSF CAREER Award, and ONR award N00014-17-1-2072.

References

Aldous, D. (1983), Exchangeability and Related Topics, volume 1117 of Lecture Notes in Mathematics, Berlin: Springer.
Google Scholar
Ayed, F., and Caron, F. (2021), “Nonnegative Bayesian Nonparametric Factor Models with Completely Random Measures,” Statistics and Computing, 31, Paper No. 63.
Web of Science ®Google Scholar
Bacallado, S., Battiston, M., Favaro, S., and Trippa, L. (2017), “Sufficientness Postulates for Gibbs-Type Priors and Hierarchical Generalizations,” Statistical Science, 32, 487–500. DOI: 10.1214/17-STS619.
Web of Science ®Google Scholar
Broderick, T., Wilson, A. C., and Jordan, M. I. (2018), “Posteriors, Conjugacy, and Exponential Families for Completely Random Measures,” Bernoulli, 24, 3181–3221. DOI: 10.3150/16-BEJ855.
Web of Science ®Google Scholar
Chakraborty, S., Arora, A., Begg, C. B., and Shen, R. (2019), “Using Somatic Variant Richness to Mine Signals from Rare Variants in the Cancer Genome,” Nature Communications, 10, 5506–5509. DOI: 10.1038/s41467-019-13402-z.
PubMed Web of Science ®Google Scholar
Ferguson, T. S. (1973), “A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1, 209–230. DOI: 10.1214/aos/1176342360.
Web of Science ®Google Scholar
Ferguson, T. S., and Klass, M. J. (1972), “A Representation of Independent Increment Processes without Gaussian Components,” Annals of Mathematical Statistics, 43, 1634–1643. DOI: 10.1214/aoms/1177692395.
Google Scholar
Goldstein, A. M., Xiao, Y., Sampson, J., Zhu, B., Rotunno, M., Bennett, H., Wen, Y., Jones, K., Vogt, A., and Burdette, L. (2017), “Rare Germline Variants in Known Melanoma Susceptibility Genes in Familial Melanoma,” Human Molecular Genetics, 26, 4886–4895. DOI: 10.1093/hmg/ddx368.
PubMed Web of Science ®Google Scholar
Gravel, S. (2014), “Predicting Discovery Rates of Genomic Features,” Genetics, 197, 601–610. DOI: 10.1534/genetics.114.162149.
PubMed Web of Science ®Google Scholar
Griffiths, T. L., and Ghahramani, Z. (2011), “The Indian Buffet Process: An Introduction and Review,” Journal of Machine Learning Research, 12, 1185–1224.
Web of Science ®Google Scholar
Hernandez, R. D., Uricchio, L. H., Hartman, K., Ye, C., Dahl, A., and Zaitlen, N. (2019), “Ultrarare Variants Drive Substantial CIS Heritability of Human Gene Expression,” Nature Genetics, 51, 1349–1355. DOI: 10.1038/s41588-019-0487-7.
PubMed Web of Science ®Google Scholar
Huyghe, J. R., Bien, S. A., Harrison, T. A., Kang, H. M., Chen, S., Schmit, S. L., Conti, D. V., Qu, C., Jeon, J., and Edlund, C. K. (2019), “Discovery of Common and Rare Genetic Risk Variants for Colorectal Cancer,” Nature Genetics, 51, 76–87. DOI: 10.1038/s41588-018-0286-6.
PubMed Web of Science ®Google Scholar
Ionita-Laza, I., and Laird, N. M. (2010), “On the Optimal Design of Genetic Variant Discovery Studies,” Statistical Applications in Genetics and Molecular Biology, 9, 1–17. DOI: 10.2202/1544-6115.1581.
PubMed Web of Science ®Google Scholar
Ionita-Laza, I., Lange, C., and Laird, N. M. (2009), “Estimating the Number of Unseen Variants in the Human Genome,” Proceedings of the National Academy of Sciences, 106, 5008–5013. DOI: 10.1073/pnas.0807815106.
PubMed Web of Science ®Google Scholar
James, L. F. (2017), “Bayesian Poisson Calculus for Latent Feature Modeling via Generalized Indian Buffet Process Priors,” The Annals of Statistics, 45, 2016–2045. DOI: 10.1214/16-AOS1517.
Web of Science ®Google Scholar
James, L. F., Orbanz, P., and Teh, Y. W. (2015), “Scaled Subordinators and Generalizations of the Indian Buffet Process,” arXiv preprint arXiv:1510.07309.
Google Scholar
Johansson, Ö, Samelius, G., Wikberg, E., Chapron, G., Mishra, C., and Low, M. (2020), “Identification Errors in Camera-Trap Studies Result in Systematic Population Overestimation,” Scientific Reports, 10, 1–10. DOI: 10.1038/s41598-020-63367-z.
PubMed Web of Science ®Google Scholar
Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., and Birnbaum, D. P. (2020), “The Mutational Constraint Spectrum Quantified from Variation in 141,456 Humans,” Nature, 581, 434–443. DOI: 10.1038/s41586-020-2308-7.
PubMed Web of Science ®Google Scholar
Kingman, J. (1992), Poisson Processes. Oxford Studies in Probability, Oxford: Clarendon Press.
Google Scholar
Kingman, J. F. (1975), “Random Discrete Distributions,” Journal of the Royal Statistical Society, Series B, 37, 1–15. DOI: 10.1111/j.2517-6161.1975.tb01024.x.
Google Scholar
Knowles, D., and Ghahramani, Z. (2011), “Nonparametric Bayesian Sparse Factor Models with Application to Gene Expression Modeling,” The Annals of Applied Statistics, 5, 1534–1552. DOI: 10.1214/10-AOAS435.
Web of Science ®Google Scholar
Lawrenson, K., Kar, S., McCue, K., Kuchenbaeker, K., Michailidou, K., Tyrer, J., Beesley, J., Ramus, S. J., Li, Q., and Delgado, M. K. (2016), “Functional Mechanisms Underlying Pleiotropic Risk Alleles at the 19p13. 1 Breast–Ovarian Cancer Susceptibility Locus,” Nature Communications, 7, 1–22. DOI: 10.1038/ncomms12675.
PubMedGoogle Scholar
Lee, A., Mavaddat, N., Wilcox, A. N., Cunningham, A. P., Carver, T., Hartley, S., de Villiers, C. B., Izquierdo, A., Simard, J., and Schmidt, M. K. (2019), “Boadicea: A Comprehensive Breast Cancer Risk Prediction Model Incorporating Genetic and Nongenetic Risk Factors,” Genetics in Medicine, 21, 1708–1718. DOI: 10.1038/s41436-018-0406-9.
PubMed Web of Science ®Google Scholar
Lijoi, A., Mena, R. H., and Prünster, I. (2007), “Bayesian Nonparametric Estimation of the Probability of Discovering New Species,” Biometrika, 94, 769–786. DOI: 10.1093/biomet/asm061.
Web of Science ®Google Scholar
Liu, Y., Xia, J., McKay, J., Tsavachidis, S., Xiao, X., Spitz, M. R., Cheng, C., Byun, J., Hong, W., and Li, Y. (2021), “Rare Deleterious Germline Variants and Risk of Lung Cancer,” NPJ Precision Oncology, 5, 1–12.
PubMed Web of Science ®Google Scholar
Masoero, L., Camerlenghi, F., Favaro, S., and Broderick, T. (2021), “More for Less: Predicting and Maximizing Genomic Variant Discovery via Bayesian Nonparametrics,” Biometrika, 109, 17–32. DOI: 10.1093/biomet/asab012.
Web of Science ®Google Scholar
Momozawa, Y., and Mizukami, K. (2020), “Unique Roles of Rare Variants in the Genetics of Complex Diseases in Humans,” Journal of Human Genetics, 66, 11–23. DOI: 10.1038/s10038-020-00845-2.
PubMed Web of Science ®Google Scholar
Nguyen-Dumont, T., MacInnis, R. J., Steen, J. A., Theys, D., Tsimiklis, H., Hammet, F., Mahmoodi, M., Pope, B. J., Park, D. J., and Mahmood, K. (2020), “Rare Germline Genetic Variants and Risk of Aggressive Prostate Cancer,” International Journal of Cancer, 147, 2142–2149. DOI: 10.1002/ijc.33024.
PubMed Web of Science ®Google Scholar
Orlitsky, A., Suresh, A. T., and Wu, Y. (2016), “Optimal Prediction of the Number of Unseen Species,” Proceedings of the National Academy of Sciences, 113, 13283–13288. DOI: 10.1073/pnas.1607774113.
PubMed Web of Science ®Google Scholar
Phelan, C. M., Kuchenbaecker, K. B., Tyrer, J. P., Kar, S. P., Lawrenson, K., Winham, S. J., Dennis, J., Pirie, A., Riggan, M. J., and Chornokur, G. (2017), “Identification of 12 New Susceptibility Loci for Different Histotypes of Epithelial Ovarian Cancer,” Nature Genetics, 49, 680–691. DOI: 10.1038/ng.3826.
PubMed Web of Science ®Google Scholar
Pitman, J. (1996), “Some Developments of the Blackwell-MacQueen urn Scheme,” in Statistics, Probability and Game Theory, volume 30 of IMS Lecture Notes Monograph Series, eds. T. S. Ferguson, L. S. Shapley, and J. B. MacQueen, pp. 245–267, Hayward, CA: Institute of Mathematical Statistics.
Google Scholar
Pitman, J. (2006), Combinatorial Stochastic Processes, volume 1875 of Lecture Notes in Mathematics, Berlin: Springer.
Google Scholar
Pitman, J., and Yor, M. (1997), “The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator,” Annals of Probability, 25, 855–900.
Web of Science ®Google Scholar
Rashkin, S., Jun, G., Chen, S., Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO), and Abecasis, G. R. (2017), “Genetics, and E. of Colorectal Cancer Consortium. Optimal Sequencing Strategies for Identifying Disease-Associated Singletons,” PLoS Genetics, 13, e1006811.
PubMedGoogle Scholar
Rasnic, R., Linial, N., and Linial, M. (2020), “Expanding Cancer Predisposition Genes with Ultra-Rare Cancer-Exclusive Human Variations,” Scientific Reports, 10, 1–9. DOI: 10.1038/s41598-020-70494-0.
PubMed Web of Science ®Google Scholar
Regazzini, E. (1978), “Intorno ad alcune questioni relative alla definizione del premio secondo la teoria della credibilià,” Giornale dell’Istituto Italiano degli Attuari, 41, 77–89.
Google Scholar
Sanders, J. G., Nurk, S., Salido, R. A., Minich, J., Xu, Z. Z., Zhu, Q., Martino, C., Fedarko, M., Arthur, T. D., and Chen, F. (2019), “Optimizing Sequencing Protocols for Leaderboard Metagenomics by Combining Long and Short Reads,” Genome Biology, 20, 1–14. DOI: 10.1186/s13059-019-1834-9.
PubMed Web of Science ®Google Scholar
Schwarze, K., Buchanan, J., Fermont, J. M., Dreau, H., Tilley, M. W., Taylor, J. M., Antoniou, P., Knight, S. J., Camps, C., and Pentony, M. M. (2020), “The Complete Costs of Genome Sequencing: A Microcosting Study in Cancer and Rare Diseases from a Single Center in the United Kingdom,” Genetics in Medicine, 22, 85–94. DOI: 10.1038/s41436-019-0618-7.
PubMed Web of Science ®Google Scholar
Souza, C. A., Murphy, N., Villacorta-Rath, C., Woodings, L. N., Ilyushkina, I., Hernandez, C. E., Green, B. S., Bell, J. J., and Strugnell, J. M. (2017), “Efficiency of ddrad Target Enriched Sequencing across Spiny Rock Lobster Species (palinuridae: Jasus),” Scientific Reports, 7, 1–14. DOI: 10.1038/s41598-017-06582-5.
Google Scholar
Wendt, C., and Margolin, S. (2019), “Identifying Breast Cancer Susceptibility Genes–A Review of the Genetic Background in Familial Breast Cancer,” Acta Oncologica, 58, 135–146.
PubMed Web of Science ®Google Scholar
Zabell, S. (2005), The Continuum of Inductive Methods Revisited, Cambridge: Cambridge University Press.
Google Scholar
Zhang, M. J., Ntranos, V., and Tse, D. (2020), “Determining Sequencing Depth in a Single-Cell RNA-Seq Experiment,” Nature Communications, 11, 1–11. DOI: 10.1038/s41467-020-14482-y.
PubMed Web of Science ®Google Scholar
Zou, J., Valiant, G., Valiant, P., Karczewski, K., Chan, S. O., Samocha, K., Lek, M., Sunyaev, S., Daly, M., and MacArthur, D. G. (2016), “Quantifying Unobserved Protein-Coding Variants in Human Populations Provides a Roadmap for Large-Scale Sequencing Projects,” Nature Communications, 7, 13293. DOI: 10.1038/ncomms13293.
PubMed Web of Science ®Google Scholar

Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation

Abstract