798
Views
0
CrossRef citations to date
0
Altmetric
Articles

Posterior contraction rate of sparse latent feature models with application to proteomics

, ORCID Icon, , &
Pages 29-39 | Received 13 Mar 2020, Accepted 08 Aug 2021, Published online: 05 Sep 2021

Abstract

The Indian buffet process (IBP) and phylogenetic Indian buffet process (pIBP) can be used as prior models to infer latent features in a data set. The theoretical properties of these models are under-explored, however, especially in high dimensional settings. In this paper, we show that under mild sparsity condition, the posterior distribution of the latent feature matrix, generated via IBP or pIBP priors, converges to the true latent feature matrix asymptotically. We derive the posterior convergence rate, referred to as the contraction rate. We show that the convergence results remain valid even when the dimensionality of the latent feature matrix increases with the sample size, therefore making the posterior inference valid in high dimensional settings. We demonstrate the theoretical results using computer simulation, in which the parallel-tempering Markov chain Monte Carlo method is applied to overcome computational hurdles. The practical utility of the derived properties is demonstrated by inferring the latent features in a reverse phase protein arrays (RPPA) dataset under the IBP prior model.

1. Introduction

The latent feature models are concerned about finding latent structures in a data set Xn×p where each row xi=(xi1,,xip) represents a single observation of p objects and n is the sample size. We consider the case where the number of objects p=pn increases as sample size n increases. The goal is to explain the variability of the observed data with a latent binary feature matrix Zp×K where each column of Z represents a latent feature that includes a subset of the p objects. The number of latent features K is unknown and is inferred as well.

Bayesian nonparametric latent feature models such as the Indian buffet process (IBP) (Griffiths & Ghahramani, Citation2006Citation2011) can be used to define the prior distribution of the binary latent feature matrix with arbitrarily many columns. In many applications (such as Chu et al., Citation2006) these priors could lead to desirable posterior inference. An important property of IBP is that the corresponding distribution maintains exchangeability across the rows that index the experimental units, making posterior inference relatively simple and easy to implement. However, sometimes the rows of the latent feature matrix must follow a group structure, such as in phylogenetic inferences. To address such needs, the phylogenetic Indian Buffet Process (pIBP) (Miller et al., Citation2008) has been developed to allow different rows to be partially exchangeable.

Despite the increasing popularity in the application of IBP and pIBP prior models, such as in cancer and evolutional genomics, few theoretical results have been discussed on the posterior inference based on these models. For example, from a frequentist view, it is important to investigate the asymptotic convergence of the posterior distribution of the latent feature matrix under IBP and pIBP priors. Existing literature on the theory of Bayesian posterior consistency includes, for example, Schwartz (Citation1965), Barron et al. (Citation1999) and Ghosal et al. (Citation2000). Chen et al. (Citation2016) is a motivational work exploring theoretical properties of the posterior distribution of the latent feature matrix based on IBP or pIBP priors. They explored the asymptotic behaviour of the IBP or pIBP-based posterior inference, where the sample size n increases in a much faster speed than the number of objects pn, i.e., the dimensionality of Z. This might be hard to achieve in some real applications. We consider important extensions based on Chen et al. (Citation2016). In particular, we consider properties of posterior inference based on IBP and pIBP priors in high dimensions and with sparsity. Under a similar high dimensional and sparse setting, related work is Pati et al. (Citation2014), where the authors studied the asymptotic behaviour of sparse Bayesian factor models discussed in West (Citation2003). These models are concerned about continuous latent features, which are different from the binary feature models like IBP and pIBP.

High dimensional inference is now routinely needed in many applications, such as genomics and proteomics. Due to the reduced cost of high-throughput biological experiments (e.g., next-generation sequencing), a number of genomics elements (such as genes) can be measured with a relatively short amount of time and low cost for a large number of patients. In our application, the number of genomics elements n is the sample size, the number of patients p is the number of rows, or the number of objects in the latent feature matrix, and the number of latent features K is assumed unknown. Depending on the particular research question, in some applications, genes can be the objects and patients can be the samples. When p=pn becomes large relative to n, the sparsity of the feature matrix critically ensures the efficiency and validity of statistical inference. We will show that under the sparsity condition, the requirement for posterior convergence can be relaxed from pn3=o(n) (Chen et al., Citation2016) to pn(logpn)2=o(n) (see Remark 3.2).

Our proposed sparsity condition can be expressed in terms of the number of features possessed by each object thus can be reasonably interpreted in practice. For example, in genomics and proteomics applications, our sparsity condition means that the number of features shared by different patients is small, i.e., the patients are heterogeneous. This is different from some published sparsity conditions that involve more complicated mathematical expressions, possibly in terms of the properties of complex matrices, which are difficult to check in real-world applications.

The rest of the paper is organized as follows. Section 2 introduces the latent feature model and the IBP/pIBP priors. Section 3 establishes the posterior contraction rate of sparse latent feature models under IBP and pIBP prior, which is the main theoretical result of this paper. Section 4 proposes an efficient posterior inference scheme based on Markov chain Monte Carlo (MCMC) simulations. Section 5 provides both simulated and real-world proteomics examples that support the theoretical derivations. We conclude the paper with a brief discussion in Section 6. Some technical details are provided in the supplement. The code for replicating the results reported in the manuscript can be accessed at https://github.com/tianjianzhou/IBP.

2. Notation and probability framework

In this section, we first introduce some notation, and then specify the hierarchical model including the sampling model and the prior model. In particular, the sampling model is the latent feature model, and the prior model is the IBP mixture or pIBP mixture.

2.1. Notation

Throughout the paper, we denote by p() and P() probability density functions (pdf) and probability mass functions (pmf), respectively. Specifically for the latent feature matrix Z, we use Π(Z) and Π(ZX) to denote the prior and posterior distribution of Z, respectively. The likelihood p(XZ)=i=1np(xiTZ). For two sequences an and bn, the notation an=O(bn) means there exists a positive real number C and a constant n0 such that anCbn for all nn0; the notation an=o(bn) means for every positive real number ε there exists a constant n0 such that anεbn for all nn0. For a matrix A, A denotes the spectral norm defined as the largest singular value of A. Finally, C is a generic notation for positive constants whose value might change depending on the context but is independent from other quantities.

2.2. Latent feature model

Suppose that Xn×p is a collection of the observed data. Each row xi=(xi1,,xip) represents a single observation of p objects, for i=1,,n, where xi's are independent. Assume that the mechanism of generating X can be characterized by latent features, (1) XT=ZA+E.(1) Here Z=(zjk)p×K denotes the latent binary feature matrix, each entry zjk{0,1} represents object j possesses feature k (zjk=1) or not (zjk=0), respectively. The loading matrix A=(aki)K×n, with each entry being the contribution of the k-feature to the ith observation. We assume akii.i.dN(0,σa2). The error matrix E=(eji)p×n, where eji's are independent Gaussian errors, ejii.i.dN(0,σ2).

After integrating out A, we obtain for each observation of the p objects, xiTi.i.dNp(0,σa2ZZT+σ2Ip), where Np is a p-variate Gaussian distribution. Therefore, the conditional distribution of X given Z is (2) p(XZ)=i=1n1(2π)pdet(σa2ZZT+σ2Ip)×exp(12xi(σa2ZZT+σ2Ip)1xiT).(2) Without loss of generality, following Chen et al. (Citation2016), we always assume that σ2=σa2=1 for deriving the theoretical results. One of the primary interests is to conduct appropriate estimation on ZZT, which is usually called the similarity matrix since each entry of ZZT is the number of features shared by two objects.

2.3. Prior distributions based on IBP and pIBP

In the latent feature model (Equation (Equation1)), it remains to specify the prior for the binary feature matrix Z. IBP and pIBP are popular prior choices on binary matrices with an unbounded number of columns. IBP assumes exchangeability among the objects, while pIBP introduces dependency among the entries of the kth column of Z through a rooted tree T. See Figure  for an example of the tree. IBP is a special case of pIBP when the root node is the only internal node of the tree. The construction and the pmf of IBP are described and derived in detail in Griffiths and Ghahramani (Citation2011). For pIBP, only a brief definition is given in Miller et al. (Citation2008). For the proof of the main theoretical result of this paper, we propose a construction of pIBP in a similar way as IBP and derive the pmf of pIBP.

Figure 1. An example of a tree structure T, which is a directed graph with random variables at the nodes (marked as circles). Entries of the kth column of Z, zjk's, are at the leaves. The lengths of all edges of T, ti's and ηl's, are marked on the figure. In particular, ηl's represent the lengths between each leaf (zjk, shaded nodes) and its parent node (ζl, dotted nodes). The total edge lengths S(T) is the summation of the lengths of all edges of T. In this example, S(T)=1i6ti+(2η1+η2+3η3+η4+4η5+η6). The condition in case (2) of Lemma 3.3 in Section 3 means inf1l6ηlη0 for some η0>0.

Figure 1. An example of a tree structure T, which is a directed graph with random variables at the nodes (marked as circles). Entries of the kth column of Z, zjk's, are at the leaves. The lengths of all edges of T, ti's and ηl's, are marked on the figure. In particular, ηl's represent the lengths between each leaf (zjk, shaded nodes) and its parent node (ζl, dotted nodes). The total edge lengths S(T) is the summation of the lengths of all edges of T. In this example, S(T)=∑1≤i≤6ti+(2η1+η2+3η3+η4+4η5+η6). The condition in case (2) of Lemma 3.3 in Section 3 means inf1≤l≤6ηl≥η0 for some η0>0.

We first introduce some notation. Let Zp×K denote the collection of binary matrices with p rows and K (KN+) columns such that none of these columns consist of all 0's. Let Zp=K=1Zp×K and let 0 denote a p-dimensional vector whose elements are all 0's, i.e., 0=(0,0,,0)T. In the following sections, we also regard 0 as a p×1 matrix when needed. Both IBP and pIBP are defined over Zp0{0}Zp. It can be shown that with probability 1, a draw from IBP or pIBP has only finitely many columns. For the construction of IBP and pIBP, we introduce some more notations as follows. Denote by Z~p×K~ the collection of all binary matrices with p rows and K~ columns (where the columns can be all zeros). We define a many-to-one mapping G():Z~p×K~Zp0. For a binary matrix Z~Z~p×K~, if all columns of Z~ are 0's, then G(Z~)=0; otherwise, G(Z~) is obtained by deleting all zero columns of Z~. For the purpose of performing inference on the similarity matrix ZZT, it suffices to focus on the set of equivalence classes induced by G(). Two matrices Z~1,Z~2Z~p×K~ are G-equivalent if G(Z~1)=G(Z~2), and in this case the similarity matrices induced by Z~1 and Z~2 are the same, Z~1Z~1T=Z~2Z~2T.

We now turn to a constructive definition of pIBP. The pIBP can be constructed in the following three steps by taking the limit of a finite feature model.

Step 1. Given some hyperparameter α and tree structure T, we start by defining a probability distribution PK~(Z~α,T) over Z~Z~p×K~. Denote by Z~=(z~1,z~2,,z~K~). Let π={π1,π2,,πK~} be a vector of success probabilities such that (πkα)i.i.d.Beta(α/K~,1). The columns of Z~ are conditionally independent given π and T, PK~(Z~π,T)=k=1K~P(z~kπk,T), where P(z~kπk,T) is determined as in Miller et al. (Citation2008) (more details in Supplementary Section S.1). If P(z~kπk,T)=j=1pP(z~jkπk) where z~jki.i.dBernoulli(πk), pIBP reduces to IBP. The marginal probability of Z~ is PK~(Z~α,T)=k=1K~P(z~kπk,T)p(πkα)dπk.

Step 2. Next, for any ZZp0 with K columns, we define a probability distribution (for K~K) ΠK~(Zα,T)Z~Z~p×K~:G(Z~)=ZPK~(Z~α,T), for PK~(Z~α,T) defined in Step 1. That is, we collapse all binary matrices in Z~p×K~ that are G-equivalent.

Step 3. Finally, for any ZZp0, define Π(Zα,T)limK~ΠK~(Zα,T). Here Π(Zα,T) is the pmf of pIBP under G-equivalence classes.

Based on the three steps of constructing pIBP, we derive the pmf of pIBP given α and T. Details on the derivation is given in Supplementary Section S.1. Let S(T) denote the total edge lengths of the tree structure (see Figure ) and ψ() denote the digamma function. For ZZp0, we have (3) Π(Zα,T)={exp{(ψ(S(T)+1)ψ(1))α},ifZ=0;exp{(ψ(S(T)+1)ψ(1))α}×αKK!k=1Kλk,ifZZp×K,(3) where λkλ(zk,T)=01P(zkπk,T)πk1dπk (See Supplementary Section S.1) and zk is the kth column of Z.

Assume αGamma(1,1). After integrating out α, we obtain the pmf of pIBP mixture, Π(ZT)={(ψ(S(T)+1)ψ(1)+1)1,ifZ=0;(ψ(S(T)+1)ψ(1)+1)(K+1)×k=1Kλk,ifZZp×K. For notational simplicity, we suppress the condition on T hereafter when we discuss pIBP.

When S(T)=p and λk=01j=1pπkmk1(1πk)pmkdπk, pIBP reduces to IBP, where mk=j=1pzjk denotes the number of objects possessing feature k. The pmf of IBP mixture is Π(Z)={(Hp+1)1,ifZ=0;(Hp+1)(K+1)k=1K(pmk)!(mk1)!p!,ifZZp×K, where Hp=j=1p(1/j).

3. Posterior contraction rate under the sparsity condition

In this section, we establish the posterior contraction rate of IBP mixture and pIBP mixture under a sparsity condition. All the proofs are given in the supplement.

The sparsity condition is defined below for a sequence of binary matrices {Zn,n=1,2,}.

Definition 3.1

Sparsity

Consider a sequence of binary matrices {Zn,n=1,2,} where Zn=(zjk)pn×Kn. Assume that mknj=1pnzjksn for some sn1 and all k=1,,Kn. We say {Zn,n=1,2,} are sparse if sn/pn0 as n.

The condition indicates that as sample size increases, the number of objects possessing any feature is upper-bounded and must be relatively small compared to the total number of objects. Therefore, such an assumption may be assessed via simulation studies and then applied to real-world applications. Examples will be provided later on. Similar but more strict assumptions are made in Castillo and van der Vaart (Citation2012) and Pati et al. (Citation2014), under different contexts.

Next, we review the definition of posterior contraction rate.

Definition 3.2

Posterior Contraction Rate

Let {Zn,n=1,2,} represent a sequence of true latent feature matrices where each Zn has pn rows. For each n, the observations are generated from xiTi.i.d.N(0,ZnZnT+Ipn) for i=1,2,,n. Denote by Π(X) the posterior distribution of the latent feature matrix under IBP mixture or pIBP mixture prior. If EZn[Π(ZnZnTZnZnTCϵnX)]1 as n, where represents the spectral norm and C is a positive constant, then we say the posterior contraction rate of ZnZnT to the true ZnZnT is ϵn under the spectral norm.

For the proof of the main theorem, we derive the following lemma. The lemma establishes the lower bound of Π(Zn=Zn) for a sequence of binary matrices {Zn,n=1,2,} satisfying the sparsity condition, where Π() represents the pmf of IBP mixture or pIBP mixture.

Lemma 3.3

Consider a sequence of binary matrices {Zn:Zn{0}Zpn×Kn,n=1,2,} that are sparse under Definition 3.1. Parameters mkn and sn are defined accordingly (for 0 matrices, let sn=Kn=1). We have Π(Zn=Zn)exp(CsnKnlog(pn+1)) for some positive constant C, if either of the following two cases is true: (1) Zn follows IBP mixture; (2) Zn follows pIBP mixture, and the minimal length between each leaf and its parent node is lower bounded by η0>0 (see Figure ).

Remark 3.1

Results in Lemma 3.3 depend on the sparsity condition. As a counterexample, we approximate Π(Zn=Zn) for a sequence of non-sparse binary matrices {Zn,n=1,2,}, where ZnZpn, and Zn follows IBP mixture. Recall that for IBP mixture, we have Π(Zn=Zn)=1(Hpn+1)Kn+1k=1Kn×(pnmkn)!(mkn1)!pn!. Let mkn=pn/2 for every column k=1,,Kn, then [(pnmkn)!(mkn1)!]/(pn!)=[(pn/2)!(pn/21)!]/(pn!) and Stirling's formula implies that [(pn/2)!(pn/21)!]/(pn!)2π/pn2pn. Since Hpn=logpn+O(1), Π(Zn=Zn)exp(CKnloglog(pn+2))×(C2πpn12pn)Knexp(CKnpn). Comparing the results obtained with and without sparsity conditions, we find the lower-bound with sparsity condition is very likely to be larger than the upper-bound without sparsity condition. To consider a very extreme case, when sn=log(pn+1), the lower-bound exp(CKn(log(pn+1))2) is much larger than exp(CKnpn).

We present the main theorem of this paper, which proves that for a sequence of true latent feature matrix Zn that satisfy the sparsity condition in Definition 3.1, the posterior distribution of the similarity matrix ZnZnT converges to ZnZnT. The theorem eventually leads to the main theoretical result in Remark 3.2 later.

Theorem 3.4

Consider a sequence of sparse binary matrices {Zn,n=1,2,} as in Definition 3.1 and the prior in either of the two cases of Lemma 3.3. For each n, suppose the observations are generated from xiTi.i.d.N(0,ZnZnT+Ipn) for i=1,2,,n. Let ϵn=max{pn,snKnlog(pn+1)}n×max{1,ZnZnT}. If ϵn0 as n, then we have EZn[Π(ZnZnTZnZnTCϵnX)]1 as n for some positive constant C. In other words, ϵn is the posterior contraction rate under the spectral norm.

For the sequence of sparse binary matrices {Zn,n=1,2,} considered in Theorem 3.4, if ZnZnTMn for some Mn1 and ϵ~nmax{pn,snKnlog(pn+1)}nMn0 as n, then ϵ~n is a valid posterior contraction rate. We show in the following corollary that if the number of features possessed by each object is upper bounded, then a new posterior contraction rate with a simpler expression (compared to Theorem 3.4) can be derived.

Corollary 3.5

Consider a sequence of sparse binary matrices {Zn,n=1,2,} as in Definition 3.1 where Zn=(zjk)pn×Kn. Suppose that there exists qn1 such that sup1jpn(k=1Knzjk)qn, i.e., the number of features possessed by each object (non-zero entries of each row of Zn) is upper bounded by qn. Given the same assumptions as in Theorem 3.4 except for replacing ϵn0 by ϵ~nmax{pn,snKnlog(pn+1)}nsnqn0 as n, we have EZn[Π(ZnZnTZnZnTCϵ~nX)]1 as n for some positive constant C. Specifically,

(1)

if there is no additional condition on qn, ϵ~n=max{pn,snKnlog(pn+1)}nsnKn;

(2)

if qn is a constant, ϵ~n=max{pn,snKnlog(pn+1)}nsn.

Remark 3.2

Consider the second case of Corollary 3.5. If (1) snKnlog(pn+1)=O(pn) and (2) sn=O(log(pn+1)), then ϵ~nCpnlog(pn+1)n; therefore, if pn(log(pn+1))2=o(n), then pnlog(pn+1)n is a valid posterior contraction rate. In other words, to ensure posterior convergence, we only need n to increase a little bit faster than pn, given the assumptions (1), (2) and the condition in the second case of Corollary 3.5.

4. Posterior inference based on MCMC

We have specified the hierarchical models including the sampling model p(XZ) and the prior models for the parameters Π(Zα) and p(α). In particular, p(XZ) is the latent feature model (Equation (Equation2)), Π(Zα) is the IBP or pIBP prior (Equation (Equation3)) and p(α)=Gamma(1,1). For the theoretical results in Section 3, we integrate out α. For posterior inference, we keep α so that the conditional distributions can be obtained in closed form.

We use Markov chain Monte Carlo simulations to generate samples from the posterior (Z,α)Π(Z,αX)p(XZ)Π(Zα)p(α). After iterating sufficiently many steps, the samples of Z drawn from the Markov chain approximately follow Π(ZX). Gibbs sampling transition probabilities can be used to update Z and α, as described in Griffiths and Ghahramani (Citation2011) and Miller et al. (Citation2008).

To overcome trapping of the Markov chain in local modes in the high-dimensional setting, we use parallel tempering Markov chain Monte Carlo (PTMCMC) (Geyer, Citation1991) in which several Markov chains at different temperatures run in parallel and interchange the states across each other. In particular, the target distribution of the Markov chain indexed by temperature T is Π(T)(Z,αX)p(XZ)1TΠ(Zα)p(α). Parallel tempering helps the original Markov chain (the Markov chain whose temperature T is 1) avoid getting stuck in local modes and approximate the target distribution efficiently.

We give an algorithm below for sampling from Π(Z,αX) where Π(Zα) follows IBP. The algorithm describes in detail how PTMCMC can be combined with the Gibbs sampler in Griffiths and Ghahramani (Citation2011). The algorithm iterates Steps 1 and 2 in turn.

Step 1 (Updating Z and α). Denote by zjk the entries of Z, j=1,,p and k=1,,K. We update Z by row. For each row j, we iterate through the columns k=1,,K. We first make a decision to drop the kth column of Z, zk, if and only if m(j)k=jjzjk=0. In other words, if feature k is not possessed by any object other than j, then the kth column of Z should be dropped, regardless of whether zjk=0 or 1.

If the kth column is not dropped, we sample zjk from Π(T)(zjk)p(XZ)1TΠ(zjkZ(jk),α), where Z(jk) represents all entries of Z except zjk and P(XZ) is determined by xiTZi.i.d.N(0,ZZT+Ip), in which xi's are rows of X. The conditional prior Π(zjkZ(jk),α) only depends on z(j)k (i.e., the kth column of Z excluding zjk). Specifically, Π(zjk=1Z(jk),α)=m(j)kp. After updating all entries in the jth row, we add a random number of Kj+ columns (features) to Z. The Kj+ new features are only possessed by object j, i.e., only the jth entry is 1 while all other entries are 0. Let Z+ denote the feature matrix after Kj+ columns are added to the old feature matrix. The conditional posterior distribution of Kj+ is P(T)(Kj+)p(XZ+)1TP(Kj+α), in which P(Kj+α) is the prior distribution of Kj+ under IBP, P(Kj+α)=Pois(α/p). The support of P(T)(Kj+) is N. For easier evaluation of P(T)(Kj+), we work with an approximation by truncating P(T)(Kj+) at level Kmax+, similar to the idea of truncating a stick-breaking prior in Ishwaran and James (Citation2001). The value Kmax+ is the maximum number of new columns (features) that can be added to Z each time we update the jth row. Denote by P~(T)(Kj+) the truncated conditional posterior, we have (4) P~(T)(Kj+=k)=P(T)(k)k=0Kmax+P(T)(k),fork=0,1,,Kmax+.(4) Lastly, we update α. Given Z, the observed data X and α are conditionally independent, which implies that the conditional posterior distribution of α at any temperature T is the same. We sample α from (α)Gamma(K+1,(j=1p1j+1)1).

Step 2 (Interchanging States across Parallel Chains). We sort the Markov chains in descending order by their temperatures T. The next step of PTMCMC is interchanging states between adjacent Markov chains. Let (Z(T),α(T)) denote the state of the Markov chain indexed by temperature T. Suppose we run N parallel chains with descending temperatures TN,TN1,,T2,T1, where T1=1. Sequentially for each i=N,N1,,2, we propose an interchange of states between (Z(Ti),α(Ti)) and (Z(Ti1),α(Ti1)) and accept the proposal with probability ATi,Ti1=min{(p(XZ(Ti1))p(XZ(Ti)))1Ti1Ti1,1}.

5. Examples

5.1. Simulation studies

We conduct simulation to examine the convergence of the posterior distribution of ZnZnT (under IBP mixture prior) to the true similarity matrix ZnZnT. We consider several true similarity matrices with different sample sizes n's and explore the contraction of the posterior distribution. Hereinafter, we suppress the index n.

5.1.1. Simulations under the sparsity condition of Z

In this simulation, the true latent feature matrix, Z=(zjk)p×K, is randomly generated under the sparsity condition in Definition 3.1 (i.e., mks for k=1,,K, where s is relatively small compared with p). We set s = 10, K=10 and p=50,100, or 150. We use s/p to measure the sparsity (as in Definition 3.1).

Once Z is generated, the samples x1,x2,,xn are generated under the sampling model (Equation (Equation2)) with xiTi.i.d.N(0,ZZT+Ip), where xi's are rows of X. For each value of p, we conduct 4 simulations, with sample sizes n = 20, 50, 80 or 100.

Given simulated data X, we use the proposed PTMCMC algorithm introduced in Section 4 to sample from the posterior distribution Π(ZX), setting Kmax+ in (Equation4) at 10. We set the number of parallel Markov chains N = 11 with geometrically spaced temperatures. Namely, the ratio between adjacent temperatures βTi+1/Ti=1.2, with T1=1. We run 1000 MCMC iterations. The chains converge quickly and mix well based on basic MCMC diagnostics.

We repeat the simulation scheme 40 times, each time generating a new data set with a new random seed and applying the PTMCMC algorithm for inference. We report several summaries in Table  based on the 40 simulation studies. The entries are the true values of (n,p) (first column), p/n (second column), average (across the 40 simulation studies) of the 1000th MCMC value of K (third column), average residual ZZTZZT based on the 1000th MCMC value of Z (fourth column), average ϵ value as defined in Theorem 3.4 (fifth column), and average ZZTZZT/ϵ value (sixth column).

Table 1. Simulation results for different combinations of n and p.

For fixed p, ZZT converges to ZZT as sample size increases. When n = 100, ZZTZZT is very close to 0 and K is close to the truth K=10. On the other hand, for fixed n, increasing p does not make the posterior distribution of ZZT significantly less concentrated at the true ZZT, implying that the inference is robust to high-dimensional matrix of Z as long as the true matrix Z is sparse. The value of ZZTZZT/ϵ demonstrates the convergence result in Theorem 3.4. As shown in Figure , as sample size increases, ZZTZZT/ϵ converges to 0. This verifies the theoretical results we have reported early on. Note that similar to Chen et al. (Citation2016), we use ZZTZZT/ϵ instead of the probability Π(ZZTZZTCϵX) to demonstrate the theoretical results due to the arbitrariness of the constant term C. With a finite number of simulation experiments, one can always find a constant C such that Π(ZZTZZTCϵX) is always 1.

Figure 2. Simulation results for different combinations of n and p. Plot of ZnZnTZnZnT/ϵn versus n, where ZnZnTZnZnT is the spectral norm for the residual of the similarity matrix, and ϵn is the posterior contraction rate defined in Theorem 3.4. The ratio converges to zero as n increases, demonstrating the theoretical results. The vertical error bars represent one standard error.

Figure 2. Simulation results for different combinations of n and p. Plot of ‖ZnZnT−Zn∗Zn∗T‖/ϵn versus n, where ‖ZnZnT−Zn∗Zn∗T‖ is the spectral norm for the residual of the similarity matrix, and ϵn is the posterior contraction rate defined in Theorem 3.4. The ratio converges to zero as n increases, demonstrating the theoretical results. The vertical error bars represent one standard error.

5.1.2. Sensitivity of sparsity

In this part, we use simulation results to demonstrate the effect of sparsity of Z in the posterior convergence. We set p = 50, K=10 and n = 100, and use different s=10,15,18,20,23,25, reducing the sparsity of Z gradually. We generate Z and X the same way as in the previous simulation. We set the number of parallel Markov chains N = 11 and Kmax+=10. To increase the frequency of interchange between adjacent chains, we reduce β to 1.15.

Table  reports the simulation results. As Z becomes less sparse, the posterior distribution of Z becomes less concentrated on Z, in terms of both K and ZZTZZT. Specifically, for s{15,20,25}, when s increases by 5, ZZTZZT inflates more than 10 fold.

Table 2. Simulation results for different s.

5.2. Application for proteomics

We apply the binary latent feature model to the analysis of a reverse phase protein arrays (RPPA) dataset from The Cancer Genome Atlas (TCGA, https://tcga-data.nci.nih.gov/tcga/) downloaded by TCGA Assembler 2 (Wei et al., Citation2017). The RPPA dataset records the levels of protein expression based on incubating a matrix of biological samples on a microarray with specific antibodies that target corresponding proteins (Sheehan et al., Citation2005; Spurrier et al., Citation2008). We focus on patients categorized as 5 different cancer types, including breast cancer (BRCA), diffuse large B-cell lymphoma (DLBC), glioblastoma multiforme (GBM), clear cell kidney carcinoma (KIRC) and lung adenocarcinoma (LUAD). Data of n = 157 proteins from p = 100 patients are available, with an equal number of 20 patients for each cancer type. Note that we consider proteins as experimental units and patients as objects in the data matrix with an aim to allocate latent features to patients (not proteins). This will be clear later when we report the inference results. The rows of the data matrix Xn×p are standardized, with j=1pxij=0 for i=1,,n. Since the patients are from different cancer types, we expect the data to be highly heterogeneous and the number of common features that two patients share to be small, i.e the feature matrix is sparse.

We fit the binary latent feature model under the IBP prior Π(Zα) and αGamma(1,1). In addition, rather than fixing σ2=σa2=1, we now assume inverse-Gamma distribution priors on them, σ2IG(1,1) and σa2IG(1,1). We run 1,000 MCMC iterations, as in the simulation studies. We also repeat the MCMC algorithm 3 times with different random seeds and do not observe substantial difference across the runs. We report point estimates (Z^,α^,σ^2,σ^a2) according to the maximum a posteriori (MAP) estimate, (Z^,α^,σ^2,σ^a2)=argmax(Z,α,σ2,σa2)p(Z,α,σ2,σa2X). Figure  shows the inferred binary feature matrix Z^, where a shaded gray rectangle indicates the corresponding patient j possesses feature k. For the real data analysis we do not know the true latent feature matrix and its sparsity, but can use the estimated Z^ to approximate the truth. From the Z^ matrix, we find that a feature is possessed by at most s^=31 patients, and therefore the sparsity of Z^ is about s^/p=0.31. If the estimated sparsity is close to the truth, according to simulation results (Section 5.1 Table ), the posterior distribution of Z should be highly concentrated at the truth Z.

Figure 3. The inferred binary feature matrix Z^ for the TCGA RPPA dataset. The dataset consists of 100 patients, with 20 patients for each of the 5 cancer type, BRCA, DLBC, GBM, KIRC and LUAD. A shaded gray rectangle indicates the corresponding patient j possesses feature k, i.e., the corresponding matrix element Z^jk=1. The columns are in descending order of the number of objects possessing each feature. The rows are reordered for better display.

Figure 3. The inferred binary feature matrix Z^ for the TCGA RPPA dataset. The dataset consists of 100 patients, with 20 patients for each of the 5 cancer type, BRCA, DLBC, GBM, KIRC and LUAD. A shaded gray rectangle indicates the corresponding patient j possesses feature k, i.e., the corresponding matrix element Z^jk=1. The columns are in descending order of the number of objects possessing each feature. The rows are reordered for better display.

5.2.1. Biological interpretation of the features

We report the unique genes for the top 10 proteins that have the largest loading values a^ki for the five most popular features. That is, for the top five features k possessed by the largest numbers of patients (the first five columns in Figure ), we report the proteins with the largest a^ki values. The a^ki values are posterior mean from the MCMC samples, in which parameters aki's are sampled from their full conditional distributions. This additional sampling step for aki's is added to the proposed PTMCMC algorithm for the purpose of assessing the biological implication of the features. It is a simple Gibbs step as the full conditional distributions of aki's are known Gaussian distributions. Table S.1 in the supplement lists these genes and their feature membership. We conduct the gene set enrichment analysis (GSEA) in Subramanian et al. (Citation2005) comparing the genes in each feature with pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa & Goto, Citation2000). The GSEA analysis reports back the enriched pathways and the corresponding genes, which is listed in Table S.2 in the supplement. We observe the following findings.

Feature 1. Among all the features, only genes in feature 1 are enriched in the cell cycle pathway. They are also enriched in the p53 signaling pathway. This indicates that feature 1 might be related to cell death and cell cycle. Feature 2. Feature 2 is enriched in many different types of pathways, which may be caused by its two key gene members: NRAS:4893 (oncogene) and PTEN:5728 (tumor suppressor). These two genes play key roles in the PI3K-Akt signaling pathway and also regulate many other cancer-related pathways. Feature 3. Genes in feature 3 are enriched in inflammation related pathways, such as the non-alcoholic fatty liver disease, Hepatitis B, viral carcinogenesis, hematopoietic cell lineage and phagosome pathways. This means feature 3 is mostly related to inflammation. Feature 4 and 5. Genes in features 4 and 5 are enriched in the largest number of pathways that are similar, with the exception of the p53 pathway (enriched with feature 4 but not 5) and the Jak-STAT signaling pathway (enriched with feature 5 but not 4). This indicates that feature 4 is more related to intracellular factors like DNA damage, oxidative stress and activated oncogenes, while feature 5 is more related to extracellular factors such as cytokines and growth factors.

Depending on the possession of the first five features, the patients in each cancer type can be further divided into potential molecular subgroups. For example, most BRCA patients possess features 1, 2 or 3, which indicate that these tumours are related to cell death and cell cycle (features 1 & 2), or inflammation (feature 3) pathways; most of the GBM and KIRC patients possess features 1, 2, 3 or 4, indicating an additional subgroup of patients with tumour associated to DNA damage (feature 4). The DLBC patients are highly heterogeneous, as many of them do not possess any of the first five main features. This has been well recognized in the literature (Zhang et al., Citation2013). Lastly, LUAD seems to have two subgroups, possessing mostly feature 1 or 2, respectively. They correspond to cell death and cell cycle functions, which suggest that these two subgroups of cancer could be related to abnormal cell death and cell cycle regulation.

We also note that there are other informative features besides the five mentioned above. For example, feature 16 is only possessed by BRCA patients, in which the top genes include ESR1, AR, GATA3, AKT1, CASP7, ETS1, BCL2, FASN and CCNE2, all of which have been shown closely related to breast cancer (see, e.g., Chaudhary et al., Citation2016; Clatot et al., Citation2017; Cochrane et al., Citation2014; Dawson et al., Citation2010; Furlan et al., Citation2014; Ju et al., Citation2007; Menendez & Lupu, Citation2017; Takaku et al., Citation2015; Tormo et al., Citation2017).

6. Conclusion and discussion

Our main contributions in this paper are (1) reducing the requirement on the growth rate of sample size with respect to dimensionality that ensures posterior convergence of IBP mixture or pIBP mixture under proper sparsity condition, (2) proposing an efficient MCMC scheme for sampling from the model, and (3) demonstrating the practical utility of the derived properties through an analysis of an RPPA dataset. The sparsity condition is mild and interpretable, making real-case applications possible. This result guarantees the validity of using IBP mixture or pIBP mixture for posterior inference in high dimensional settings theoretically.

There are several directions along which we plan to investigate further. First, since the assumptions made on the true latent feature matrix Zn are quite mild, the posterior convergence in Theorem 3.4 only holds when pn=o(n). It is of interest whether posterior convergence still holds when pn increases faster, e.g., pnn. As a trade-off, results with a faster-increasing pn would likely require additional assumptions on Zn, such as the Assumption 3.2 (A3) in Pati et al. (Citation2014). It is also of interest to explore whether the contraction rate in Theorem 3.4 can be further improved with additional assumptions. This is closely related to the problem of minimax rate optimal estimators for ZnZnT, or more broadly, the covariance matrix of random samples, which has been partially addressed in Pati et al. (Citation2014). Second, in Equation (Equation2), the two variance parameters σ2 and σa2 are assumed to take the value of 1. Generalization to the case where σ2 and σa2 are unknown can be made in a straightforward manner following Chen et al. (Citation2016). Briefly, the idea is to assume independent priors for σ2 and σa2, put regulatory conditions on the prior densities, and show that the new contraction rate is a function of the (constant) hyperparameters.

Another potential direction for further investigation is to extend the latent feature model (Equation1) to a more general latent factor model, in which the binary matrix Z is replaced with a real-valued factor matrix G. The binary matrix Z is then used to indicate the sparsity of G. See, e.g., Knowles and Ghahramani (Citation2011). To prove posterior convergence for such a model, Lemma 3.3 needs to be modified based on the factor loading matrix, such as Lemma 9.1 in Pati et al. (Citation2014).

By definition, the pIBP assumes a single tree structure T for the p objects, where the tree T is the same for all the features, i.e., columns of Z (Chen et al., Citation2016; Miller et al., Citation2008). When there is a reason to believe that the tree structure varies across features, one may consider an extension of the pIBP model that allows different tree structures across features. For example, if it is anticipated that some latent features have different interpretations from the others, the dependence structure of the p objects may also change across features, resulting in multiple trees. However, such a model is expected to impose additional theoretical and computational challenges.

Throughout this paper we measure the difference between the similarity matrices by the spectral norm. Other matrix norms, such as the Frobenius norm, may be explored. Our current results focus on the posterior convergence of ZnZnT rather than Zn itself due to the identifiability issue of Zn arose from (Equation2). A future direction is to investigate to what extent can Zn be estimated, and a Hamming distance like measure between the feature matrices can be considered.

Finally, we are working on general hierarchical models that embed sparsity into the model construction.

Supplemental material

Supplemental Material

Download PDF (275.5 KB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

References