Full article: Batch Effects Correction with Unknown Subtypes

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

High-throughput experimental data are accumulating exponentially in public databases. Unfortunately, however, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. Existing methods either tackle batch effects provided that subtypes are known or cluster subtypes assuming that batch effects are absent. Consequently, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we combine a location-and-scale adjustment model and model-based clustering into a novel hybrid one, the batch-effects-correction-with-unknown-subtypes model (BUS). BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity. We prove the identifiability of BUS and provide conditions for study designs under which batch effects can be corrected. BUS is evaluated by simulation studies and a real breast cancer dataset combined from three batches measured on two platforms. Results from the breast cancer dataset offer much better biological insights than existing methods. We implement BUS as a free Bioconductor package BUScorrect. Supplementary materials for this article are available online.

KEYWORDS:

1. Introduction

To date, more than 1.7 million samples have been deposited into the Gene Expression Omnibus (Edgar et al. Citation2002), which provides unprecedented opportunities to decipher the gene regulation machinery (Pickrell et al. Citation2010), understand disease mechanisms (Chahrour et al. Citation2008), and develop personalized treatments (Suárez-Fariñas et al. Citation2010). Nevertheless, digging in this gold mine is a daunting task. On the one hand, raw data from high-throughput experiments suffer from various types of technical artifacts. On the other hand, biological samples are inherently heterogenous. How to tease out meaningful biological variations from technical artifacts is the major complication in fully using the abundant data sources available for downstream analysis.

Researchers have long been aware that samples generated on different days are not directly comparable. Samples processed at the same time are usually referred to as coming from the same “batch.” Even when the same biological conditions are measured, data from different batches can present very different patterns. The variation among different batches may be due to changes in laboratory conditions, preparation time, reagent lots, and experimenters (Leek et al. Citation2010). The effects caused by these systematic factors are called “batch effects.” Batch effects are prevalent among all types of high-throughput technologies, ranging from microarray (Irizarry et al. Citation2005) and next-generation sequencing (Taub et al. Citation2010) to single-cell sequencing experiments (Hicks et al. Citation2015).

Various “batch effects” correction methods have been proposed when the subtype information for each sample is known. Here, we adopt a “broad” definition for “subtype.” “Subtype” is defined as a set of samples that share the same underlying genomic profile, in other words biological variability, when measured with no technical artifacts. For instance, groupings such as “case” and “control” can be viewed as two subtypes. Given subtypes that are known, Johnson et al. (Citation2007) adjusted batch effects for microarray data in a model-based location-and-scale (L/S) scheme via an empirical Bayesian approach called ComBat. Their model also allows the incorporation of known confounding factors such as age and gender. However, in many experiments, researchers have no measurements of these confounding factors. To adjust for “unmodeled factors,” Leek and Storey (Citation2007) introduced surrogate variable analysis (SVA) for microarray data. Although unmodeled factors cannot be reconstructed exactly, the surrogate variables, whose expanded linear space is the same as that of unmodeled factors, can be learned directly from the data. Accordingly, introducing surrogate variables into the model sidesteps spurious variation due to unmodeled factors and provides accurate estimation of the association between disease status and gene expression. Later, Leek (Citation2014) extends SVA to deal with batch effects in next generation sequencing datasets.

Unfortunately, both ComBat and SVA require the subtype information for each sample and are infeasible when the subtypes are unknown. The frozen robust multi-array analysis (fRMA; McCall et al. Citation2010) was developed to normalize microarray samples individually so that samples from different batches are comparable after normalization. fRMA relies on training a large database of samples from the same microarray platform. As a result, fRMA is only available for a very limited number of array platforms and cannot handle batches measured on different platforms. The single-channel array normalization (SCAN) proposed by Piccolo et al. (Citation2012) can normalize microarray samples measured by different platforms. SCAN fits a two-component normal mixture to the probes of a single sample, one component for background noise and the other for signals. However, SCAN only looks at one sample at a time and thus fails to identify systematic biases associated with the batch. To borrow information and estimate the correlation of measurements from different platforms, Franks et al. (Citation2015) proposed a hierarchical model and applied it to quantify the coordination between transcription and translation level in yeast. Nevertheless, the model assumes a single biological condition without subtype heterogeneity. Therefore, to the best of our knowledge, the general problem of batch effects correction in the presence of unknown subtypes remains an open problem.

Meanwhile, assuming the absence of batch effects, there is extensive research on subtype discovery. Pan and Shen (Citation2007) proposed a model-based clustering algorithm using penalized Expectation Maximization (EM). The approach appends a Lasso penalty (Tibshirani Citation1996) to the Q function in the EM algorithm (Dempster et al. Citation1977). The missing data formulation in the EM algorithm clusters samples into subtypes. The penalty term helps to distinguish the genes whose expression levels remain the same across all subtypes from those genes with varying means among different subtypes. This latter set of genes characterizing distinct subtypes is called the intrinsic gene set (Huo et al. Citation2016). Wang and Zhu (Citation2008) improved the penalized approach by replacing the Lasso penalty with an L_∞-norm penalty or a hierarchical penalty. Consequently, parameters belonging to one gene are treated as a single group and penalized together, thus reducing the number of falsely included noninformative genes. Intrinsic gene selection for subtype discovery is also discussed in sparse K-means (Witten and Tibshirani Citation2012). Note that the traditional K-means can be viewed as maximizing Between Cluster Sum of Squares (BCSS), which can be further partitioned into the summation of gene-specific BCSS. Sparse K-means give nonnegative weights to each gene-specific BCSS term and then optimize the weighted BCSS under L₁ and L₂ constraints on weights. Finally, genes with nonzero weights remain in the intrinsic gene set, and clustering is achieved simultaneously. However, none of the aforementioned methods consider batch effects. As a result, direct application of the above methods to data combined from several batches will lead to problematic scientific conclusions. However, clustering subtypes within each batch separately suffers from small sample size and fails to use shared information among the different batches. Therefore, methods for removing batch effects from datasets with unknown subtypes are important and in urgent demand.

Recently, building upon sparse K-means, Huo et al. (Citation2016) proposed MetaSparseKmeans to discover subtypes using samples from multiple batches. MetaSparseKmeans designs a pattern-matching reward function to encourage the matching of the same subtype across batches. Nevertheless, MetaSparseKmeans requires each batch to contain all subtypes, which is not feasible for many experimental designs. Moreover, the computational complexity of calculating the pattern-matching reward function is O((K!)^{B − 1}) for K subtypes and B batches. As admitted by the authors, even when K = 5 and B = 5, an accurate exhaustive search is prohibitive for 207.36 million comparisons. Furthermore, most importantly, MetaSparseKmeans does not explicitly characterize and correct for batch effects. Consequently, it is only limited to the clustering problem and cannot benefit other analyses such as differential gene expression detection and gene regulatory network construction.

In this article, we integrate the L/S model with model-based clustering and propose the batch-effects-correction-with-unknown-Subtypes (BUS) approach. BUS simultaneously corrects batch effects, discovers subtypes, and selects features that discriminate different subtypes. After correcting the batch effects with BUS, the corrected value can be used for other analysis as if all samples are measured in a single batch. BUS can integrate batches measured from different platforms and allow subtypes to be measured in some but not all of the batches. We prove the identifiability of BUS. Based on the theoretical results, we provide conditions for experimental designs under which batch effects can be corrected.

We conduct statistical inference under the Bayesian framework and develop a Gibbs sampler to draw samples from the posterior distribution. BUS can compute efficiently with the computational complexity growing linearly as subtype number K and batch number B increase. Our simulation studies demonstrate that BUS accurately estimates batch effects and subtype effects, clusters subtypes, as well as selects intrinsic genes. Finally, BUS is applied to three batches of breast cancer microarray data with five subtypes measured on two different platforms (Huo et al. Citation2016). Although we focus on gene expression microarray data here, the same framework can be adapted to DNA methylation microarrays, RNA-seq data, and single-cell sequencing data.

2. Model Formulation

2.1. Location-and-Scale Adjustments

In this subsection, we review the classic location-and-scale adjustment model (L/S) for batch effects correction when the subtypes are known (Johnson et al. Citation2007). The L/S model characterizes two types of batch effects, the additive effects influencing the means and the multiplicative effects affecting the variances of gene expression. Specifically, let us denote the gene expression level of gene g for the sample j in the batch b by Y_bjg, then the L/S model is formulated as follows: (2.1) $Y_{b j g} = α_{g} + X_{b j} μ_{g} + γ_{b g} + δ_{b g} ε_{b j g} .$ (2.1) Here, X_bj encodes the study design, that is, which subtype each sample comes from. Suppose there are K subtypes in the dataset, then X_bj is a K-dimensional binary vector with its kth element being one and the others being zeros if sample j in the batch b belongs to subtype k. Consequently, $μ_{g} = (μ_{g 1}, \dots, μ_{g_{K}})$ records the subtype effects for gene g. To make the model identifiable, in the following analysis, we set μ_g1 = 0. Subsequently, α_g represents the mean gene expression level of gene g in subtype one.

In terms of batch effects, the “location” term γ_bg captures the shift in the mean due to additive effects, and the “scale” term δ_bg quantifies the variability of the gene expression caused by the multiplicative batch effects. For model identifiability, without loss of generality, we always refer to batch one as the “reference batch,” where the location batch effects and the scale batch effects are assumed to be absent. In other words, γ_1g = 0 and δ_1g = 1 for all 1 ⩽ g ⩽ G. Finally, the noise term ϵ_bjg is assumed to follow a normal distribution N(0, σ²_0g) with mean zero and variance σ²_0g.

For the convenience of future discussion, we absorb δ_bgϵ_bjg into ε_bjg and arrive at the equivalent model: (2.2) $Y_{b j g} = α_{g} + X_{b j} μ_{g} + γ_{b g} + ϵ_{b j g},$ (2.2) where ε_bjg follows N(0, σ²_bg), and note that the ratio of σ²_bg to σ²_1g (= δ²_bg) is the squared multiplicative batch effect.

2.2. The BUS Model

Nevertheless, the L/S model assumes that the subtype information is known to the investigator. To handle the case in which subtype information is unknown, we propose the following Batch-effects-correction-with-Unknown-Subtypes (BUS) model built upon the L/S model. The main challenge is that the subtype indicators X_bj in Equation (Equation2.2(2.2) $Y_{b j g} = α_{g} + X_{b j} μ_{g} + γ_{b g} + ϵ_{b j g},$ (2.2) ) now become missing data, and hence, we need to infer the subtype information for each sample as well. Here, we employ ideas from the literature on model-based clustering (Banfield and Raftery Citation1993; Yeung et al. Citation2001; Fraley and Raftery Citation2002; McLachlan and Peel Citation2004).

Let us first assume that all of the samples come from K subtypes and are measured in a single batch. Therefore, we temporarily drop the subscript b for batch indicators and collect all of the gene expression values for sample j into Y_j = (Y_j1, …, Y_jG). If sample j belongs to subtype k, then we assume Y_j follows a multivariate Gaussian distribution N(m_k, Σ) with a subtype mean m_k = (m_1k, …, m_Gk) and a common diagonal covariance matrix $Σ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{G}^{2})$ across the subtypes (Fraley and Raftery Citation2002; McLachlan and Peel Citation2004; Bickel and Levina Citation2004). The resulting model is a Gaussian mixture: (2.3) $\begin{matrix} Y_{j} & \sim & π_{1} N (m_{1}, Σ) + π_{2} N (m_{2}, Σ) + \dots \\ + π_{K} N (m_{K}, Σ), iid j = 1, \dots, n, \end{matrix}$ (2.3) where “iid” stands for “independent and identically distributed,” and π_k indicates the proportion of subtype k in the samples, satisfying π_k ⩾ 0, ∑^K_{k = 1}π_k = 1. Bringing in a subtype indicator Z_j for each sample j, in which Z_j = k if sample j comes from subtype k, an equivalent model can be formulated as follows: for 1 ⩽ j ⩽ n and 1 ⩽ g ⩽ G (2.4) $\begin{matrix} Z_{j} \sim Multinomial (1; π_{1}, \dots, π_{K}); \\ Y_{j g} \sim N (m_{g k}, σ_{g}^{2}) | Z_{j} = k . \end{matrix}$ (2.4) To align with model 2.2, Z_j = k can be mapped to X_j with its kth element equal to one and all of the others being zero and vice versa.

Going one step further, we now consider the multi-batch case in which heterogenous samples from multiple subtypes are measured in multiple batches. We assume that there are in total B batches and K subtypes, and batch b contains n_b samples. For each sample, gene expression levels in all G genes are measured. As in Section 2.1, we denote the gene expression level of gene g for the sample j in the batch b by Y_bjg. We allow the proportion of subtypes to vary from one batch to another. Therefore, we use π_bk to denote the proportion of subtype k in the batch b; thus, ∑^K_{k = 1}π_bk = 1. Consequently, the integration of Model 2.2 and Model 2.3 gives rise to: for 1 ⩽ b ⩽ B, 1 ⩽ j ⩽ n_b, 1 ⩽ g ⩽ G (2.5) $\begin{matrix} Z_{b j} \sim Multinomial (1; π_{b 1}, \dots, π_{b K}); \\ Y_{b j g} \sim N (m_{b g k}, σ_{b g}^{2}) | Z_{b j} = k; \\ m_{b g k} = α_{g} + X_{b j} μ_{g} + γ_{b g} = α_{g} + μ_{g k} + γ_{b g} | Z_{b j} = k . \end{matrix}$ (2.5) Accordingly, the complete likelihood function for the observed data $Y = {Y_{b j g}}_{b = 1, \dots, B; j = 1, \dots, n_{b}}^{g = 1 \dots, G}$ and missing data $Z = {Z_{b j}}_{b = 1, \dots, B; j = 1, \dots, n_{b}}$ becomes: $\begin{matrix} L_{c} (Θ | Y, Z) = \prod_{b = 1}^{B} \prod_{j = 1}^{n_{b}} \prod_{k = 1}^{K} \\ {[π_{b k} \prod_{g = 1}^{G} \frac{1}{\sqrt{2 π} σ_{b g}} \times exp \{- \frac{{(y_{b j g} - α_{g} - μ_{g k} - γ_{b g})}^{2}}{2 σ_{b g}^{2}}\}]}^{I (Z_{b j} = k)}, \end{matrix}$ where Θ is composed of all unknown parameters {π_bk: 1 ⩽ b ⩽ B, 1 ⩽ k ⩽ K}, {α_g: 1 ⩽ g ⩽ G}, {μ_gk: 1 ⩽ g ⩽ G, 2 ⩽ k ⩽ K}, {γ_bg: 2 ⩽ b ⩽ B, 1 ⩽ g ⩽ G}, and {σ²_bg: 1 ⩽ b ⩽ B, 1 ⩽ g ⩽ G}. Recall that μ_g1 and γ_1g (1 ⩽ g ⩽ G) are constrained to zero for identifiability.

As a side note, BUS can also handle known confounding factors such as age, gender, and BMI by modifying m_gk to $α_{g} + μ_{g k} + U_{b j} β_{g} + γ_{b g}$ , where U_bj corresponds to the confounding variables.

3. Identifiability

In this section, we first prove that BUS is identifiable when every subtype presents on every batch. Next, we extend the results to more general study design and provide guidelines on the experimental design so that batch effects can be corrected.

3.1. With Complete Subtypes

We first investigate the scenario as proposed by MetaSparseKmeans (Huo et al. Citation2016) in which all subtypes are required to occur in all the batches. We refer to this setting as the “complete subtypes” case. In BUS model, the likelihood function for the observed data is $\begin{matrix} L_{o} (Θ | Y) = \prod_{b = 1}^{B} \prod_{j = 1}^{n_{b}} \\ [\sum_{k = 1}^{K} π_{b k} \prod_{g = 1}^{G} \frac{1}{\sqrt{2 π} σ_{b g}} \times exp \{- \frac{{(y_{b j g} - α_{g} - μ_{g k} - γ_{b g})}^{2}}{2 σ_{b g}^{2}}\}], \end{matrix}$ where Θ consists of all unknown parameters $Θ = {π, α, μ_{2}, \dots, μ_{K}, γ_{2}, \dots, γ_{B}, σ_{1}, \dots, σ_{B}}$ with $π = {π_{b k} \geq 0 : 1 \leq b \leq B, 1 \leq k \leq K}$ , $α = {α_{g} : 1 \leq g \leq G}$ , $μ_{k} = {μ_{g k} : 1 \leq g \leq G}, γ_{b} = {γ_{b g} : 1 \leq g \leq G}$ , and $σ_{b} = {σ_{b g} : 1 \leq g \leq G}$ . Recall that $μ_{1} = 0$ , $γ_{1} = 0$ .

Theorem 1.

If $μ_{k_{1}} - μ_{k_{2}} \neq μ_{k_{3}} - μ_{k_{4}}$ for any (k₁, k₂) ≠ (k₃, k₄) (Assumption I) and π_bk > 0 for every b and k, then BUS is identifiable (up to label switching) in the sense that L_o(Θ|Y) = L_o(Θ*|Y) for any Y implies that π_bk = π*_bρ(k), $α + μ_{k} = α^{*} + μ_{ρ (k)}^{*}$ , $γ_{b} = γ_{b}^{*}$ , and $σ_{b} = σ_{b}^{*}$ , where ρ is a permutation of {1, 2, …, K}.

The condition on π_bk > 0 corresponds to the “complete subtypes” experimental design. Assumption I is a very mild one. It actually only asks for the existence of one gene g whose mean expression differences between subtypes satisfy $μ_{g k_{1}} - μ_{g k_{2}} \neq μ_{g k_{3}} - μ_{g k_{4}}$ for any (k₁, k₂) ≠ (k₃, k₄), which is always easily met for high-throughput biology data.

3.2. With Missing Subtypes

Next, we investigate the scenario in which subtypes are measured in some but not all of the batches, which is the general study design usually encountered in real life. We call this setting as “with missing subtypes.” In other words, we allow π_bk = 0 for some b and k. In the following, we use C_b to denote the subtypes that are present in batch b and D_b to denote all of the data measured on batch b. We provide two types of experimental design that guarantee the identifiability of BUS.

Theorem 2.

Given (A) $⋃_{b = 1}^{B} C_{b} = {1, 2, \dots, K}$ , (B) K_b = |C_b|, the cardinality of C_b, K_b ⩾ 2 for every batch b, (C) C₁ = {1, 2, …, K} and Assumption I is satisfied, then BUS is identifiable (up to label switching).

Condition (C) assumes that there exist a reference batch, without loss of generality taken as batch one, that contains samples from all subtypes. In practice, this assumption often holds if we have a large database built upon previous experiments. In a new experiment consisting of batches two to B (B ⩾ 2), if for each batch b (2 ⩽ b ⩽ B) we collect samples that are from at least two subtypes present in batch one, then Theorem 2 holds. Therefore, we can apply BUS to the dataset {D_b: b = 1, …, B} to separate the confounding batch effects from the true biological variations in the new experiment.

Theorem 3.

Given (A) $⋃_{b = 1}^{B} C_{b} = {1, 2, \dots, K}$ , (B) K_b = |C_b|, the cardinality of C_b, K_b ⩾ 2 for every batch b, (D) $| C_{b} \cap C_{b - 1} | \geq 2$ for every b ⩾ 2 and Assumption I is satisfied, then BUS is identifiable (up to label switching).

Theorem 3 tells us how to design a valid study when we do not have a large database that contains every subtype. In detail, we can begin with batch one whose samples are known to be taken from at least two subtypes. When preparing samples for batch two, we recommend collecting samples that are from at least two known subtypes in batch one. Based on Theorem 3, we can then apply BUS to {D_b: b = 1, 2} to learn the subtype for each sample in batch two. In the same spirit, we require the samples for batch three to be from at least two learned subtypes in batch two, and we then apply BUS to {D_b: b = 1, 2, 3} and so on and so forth. This chain-type experimental design satisfies the conditions of Theorem 3, so batch effects are estimable.

In a nutshell, both the conditions for Theorems 2 and 3 can be executed and implemented in planning real multi-batch experiments. Therefore, from the perspective of practitioners, these conditions help to guide experimental designs which can simultaneously filter out batch effects and keep true biological variations.

4. Statistical Inference

4.1. Prior Specification

We adopt a full Bayesian approach to conduct statistical inference (Hein et al. Citation2005; Gelman et al. Citation2014). We assign independent conjugate priors to each component of Θ as follows: $π_{b} = (π_{b 1}, \dots, π_{b K}) \sim D i r (α, \dots, α), 1 \leq b \leq B$ ; α_g ∼ N(m, τ²_m), 1 ⩽ g ⩽ G; γ_bg ∼ N(0, τ²_γ), 2 ⩽ b ⩽ B, 1 ⩽ g ⩽ G; $σ_{b g}^{2} \sim I n v Γ (\tilde{a}, \tilde{b}), 1 \leq b \leq B, 1 \leq g \leq G$ with hyper-parameters $(α, m, τ_{m}^{2}, τ_{γ}^{2}, \tilde{a}, \tilde{b})$ .

Regarding the subtype effect μ_gk, we assign a spike-and-slab prior (George and McCulloch Citation1993) using a normal mixture. Specifically, one component of the mixture concentrates near zero with a small variance, and the other is more dispersed with a larger variance. To represent the mixture distribution, we introduce the latent variable L_gk to indicate which component of the mixture distribution μ_gk comes from. When L_gk = 0, gene g is believed to have the same expression level in subtype k and subtype one. Therefore, μ_gk is assumed to be close to zero, following the normal component with a small variance. When L_gk = 1, gene g is differentially expressed (DE) between subtype k and subtype one. Consequently, μ_gk tends to largely deviate from zero, following the normal component with a large variance. As a result, the expression of gene g does not hold constant across subtypes if and only if D_g≐∑^K_{k = 2}L_gk > 0. We call such genes intrinsic genes following Huo et al. (Citation2016) and name D_gs intrinsic gene indicators. Scientifically, L_gks identify the set of genes that define and differentiate subtypes. Denoting the proportion of L_gks being one by p, the relationship between μ_gk and L_gk for g = 1, …, G; k = 2, …, K is as follows: $\begin{matrix} L_{g k} \sim Bernoulli (p); \\ μ_{g k} \sim N (0, τ_{μ 1}^{2}) | L_{g k} = 1; \\ μ_{g k} \sim N (0, τ_{μ 0}^{2}) | L_{g k} = 0 . \end{matrix}$ τ²_μ1 is set to a large number, and τ²_μ0 follows an inverse-gamma prior InvΓ(a_τ, b_τ) with a small prior mean. Meanwhile, p has the conjugate prior $Beta (a_{p}, b_{p})$ .

With all the priors, the full posterior distribution f(Θ, Z, L|Y) is proportional to $\begin{matrix} \prod_{b = 1}^{B} \prod_{j = 1}^{n_{b}} \prod_{k = 1}^{K} \\ {[π_{b k} \prod_{g = 1}^{G} \frac{1}{\sqrt{2 π} σ_{b g}} exp \{- \frac{{(y_{b j g} - α_{g} - μ_{g k} - γ_{b g})}^{2}}{2 σ_{b g}^{2}}\}]}^{I (Z_{b j} = k)} \\ \cdot \prod_{b = 1}^{B} D i r (π_{b}; α) \prod_{g = 1}^{G} N (α_{g}; m, τ_{m}^{2}) \\ \cdot \prod_{b = 2}^{B} \prod_{g = 1}^{G} N (γ_{b g}; 0, τ_{γ}^{2}) \prod_{b = 1}^{B} \prod_{g = 1}^{G} Inv Γ (σ_{b g}^{2}; \tilde{a}, \tilde{b}) \\ \cdot \prod_{g = 1}^{G} \prod_{k = 2}^{K} [N (μ_{g k}; 0, τ_{μ 1}^{2}) \cdot L_{g k} + N (μ_{g k}; 0, τ_{μ 0}^{2}) \cdot (1 - L_{g k})] \\ \cdot \prod_{g = 1}^{G} \prod_{k = 2}^{K} p^{L_{g k}} {(1 - p)}^{1 - L_{g k}} \cdot Beta (p; a_{p}, b_{p}) \cdot Inv Γ (τ_{μ 0}^{2}; a_{τ}, b_{τ}) . \end{matrix}$

4.2. Posterior Inference

To explore the posterior distribution, we develop a Gibbs sampler algorithm to draw samples (Geman and Geman Citation1984; Robert and Casella Citation2013). At iteration t:

1.	Update the inclusion probability p^[t] for L_gks from $Beta (\sum_{g = 1}^{G} \sum_{k = 2}^{K} L_{g k}^{[t - 1]} + a_{p}, G (K - 1) - \sum_{g = 1}^{G} \sum_{k = 2}^{K} L_{g k}^{[t - 1]} + b_{p}) .$
2.	Sample the variance of the spike component of the spike-and-slab prior (τ^[t]_μ0)² from $\begin{matrix} Inv Γ (a_{τ} + \frac{1}{2} # \{(g, k) : L_{g k}^{[t - 1]} = 0, 2 \leq k \leq K\}, \\ b_{τ} + \frac{1}{2} \sum_{{(g, k) : L_{g k} = 0}} (μ_{g k}^{[t - 1]})^{2}) . \end{matrix}$
3.	For each gene g and for 2 ⩽ k ⩽ K, update indicator L^[t]_gk from $\begin{matrix} Bernoulli \\ (\frac{p^{[t]} \cdot N (μ_{g k}^{[t - 1]}; 0, τ_{μ 1}^{2})}{p^{[t]} \cdot N (μ_{g k}^{[t - 1]}; 0, τ_{μ 1}^{2}) + (1 - p^{[t]}) \cdot N (μ_{g k}^{[t - 1]}; 0, (τ_{μ 0}^{[t]})^{2})}) . \end{matrix}$
4.	For each batch b, sample subtype proportions $π_{b}^{[t]}$ from the Dirichlet distribution
5.	For each sample j in every batch b, update its subtype indicator according to $\begin{matrix} p (Z_{b j}^{[t]} = k \| -) \\ \propto π_{b k}^{[t]} exp \{- \sum_{g = 1}^{G} \frac{(y_{b j g} - α_{g}^{[t - 1]} - μ_{g k}^{[t - 1]} - γ_{b g}^{[t - 1]})^{2}}{2 (σ_{b g}^{[t - 1]})^{2}}\}, \end{matrix}$ where “-” indicates the rest of variables.
6.	For each gene g, sample its baseline expression level α^[t]_g from $\begin{matrix} N (\frac{τ_{m}^{2} \sum_{b = 1}^{B} \sum_{j = 1}^{n_{b}} [(y_{b j g} - μ_{g Z_{b j}^{[t]}}^{[t - 1]} - γ_{b g}^{[t - 1]}) \frac{1}{(σ_{b g}^{[t - 1]})^{2}}] + m}{τ_{m}^{2} \sum_{b = 1}^{B} \frac{n_{b}}{(σ_{b g}^{[t - 1]})^{2}} + 1}, \\ \frac{τ_{m}^{2}}{τ_{m}^{2} \sum_{b = 1}^{B} \frac{n_{b}}{(σ_{b g}^{[t - 1]})^{2}} + 1}) . \end{matrix}$
7.	For each gene g in subtype two to K, sample its subtype effect μ^[t]_gk from $\begin{matrix} if L_{g k}^{[t]} = 1, \\ N (\frac{τ_{μ 1}^{2} \sum_{b = 1}^{B} \sum_{j \in {1 \leq j \leq n_{b} : Z_{b j}^{[t]} = k}} (y_{b j g} - α_{g}^{[t]} - γ_{b g}^{[t - 1]}) \frac{1}{(σ_{b g}^{[t - 1]})^{2}}}{τ_{μ 1}^{2} \sum_{b = 1}^{B} # {j : Z_{b j}^{[t]} = k} \cdot \frac{1}{(σ_{b g}^{[t - 1]})^{2}} + 1}, \\ \frac{τ_{μ 1}^{2}}{τ_{μ 1}^{2} \sum_{b = 1}^{B} # {j : Z_{b j}^{[t]} = k} \cdot \frac{1}{(σ_{b g}^{[t - 1]})^{2}} + 1}); \\ if L_{g k}^{[t]} = 0, \\ N (\frac{τ_{μ 0}^{[t] 2} \sum_{b = 1}^{B} \sum_{j \in {1 \leq j \leq n_{b} : Z_{b j}^{[t]} = k}} (y_{b j g} - α_{g}^{[t]} - γ_{b g}^{[t - 1]}) \frac{1}{(σ_{b g}^{[t - 1]})^{2}}}{τ_{μ 0}^{[t] 2} \sum_{b = 1}^{B} # {j : Z_{b j}^{[t]} = k} \cdot \frac{1}{(σ_{b g}^{[t - 1]})^{2}} + 1}, \\ \frac{τ_{μ 0}^{[t] 2}}{τ_{μ 0}^{[t] 2} \sum_{b = 1}^{B} # {j : Z_{b j}^{[t]} = k} \cdot \frac{1}{(σ_{b g}^{[t - 1]})^{2}} + 1}) . \end{matrix}$
8.	For each gene in batch two to B, sample the additive “location” batch effects γ^[t]_bg from $N (\frac{τ_{γ}^{2} \sum_{j = 1}^{n_{b}} (y_{b j g} - α_{g}^{[t]} - μ_{g Z_{b j}^{[t]}}^{[t]}) \frac{1}{(σ_{b g}^{[t - 1]})^{2}}}{τ_{γ}^{2} \frac{n_{b}}{(σ_{b g}^{[t - 1]})^{2}} + 1}, \frac{τ_{γ}^{2}}{τ_{γ}^{2} \frac{n_{b}}{(σ_{b g}^{[t - 1]})^{2}} + 1}) .$
9.	Sample the multiplicative “scale” batch effects from the inverse-Gamma distribution for each gene in each batch.

To determine the number of iterations for the Gibbs sampler, we adopt the estimated potential scale reduction (EPSR) factors criterion (Gelman et al. Citation2014) (see supplementary materials Section S1). Based on the collected samples from the Gibbs sampler, we conduct posterior inferences. For the underlying subtype effects (μ_gks), location batch effects (α_bgs), and scale batch effects (σ²_bgs), which take continuous values, we use the means of their posterior samples for estimation, since the posterior mean minimizes the Bayes risk (Casella and Berger Citation2002). Regarding clustering, we take the posterior mode of samples for Z_bj as the subtype for sample j in batch b.

Compared to single-batch-based methods, BUS can borrow information across all batches and all genes. For example, in step 7 of the Gibbs sampler, updating subtype effects μ^[t]_gk depends on data related to gene g and subtype k in all of the batches, which offers more robust and accurate estimation. On the other hand, in step 5, the subtype determination for a given sample uses information across the genes. The two-way information sharing across genes and batches improves the statistical power of BUS.

Recall that gene g is an intrinsic gene if L_gk = 1 for some k (2 ⩽ k ⩽ K). To reduce the errors in inferring L_gk’s, we control the Bayesian false discovery rate (FDR; Newton et al. Citation2004; Peterson et al. Citation2015). We denote by ${PPI}_{g k}$ the posterior marginal probability for gene g to be DE in subtype k compared to subtype one and let $ξ_{g k} = 1 - {PPI}_{g k}$ ; then according to Newton et al. (Citation2004) and Peterson et al. (Citation2015), the expected Bayesian FDR for inferring DE indicators becomes (4.1) $FDR (κ) = \frac{\sum_{g = 1}^{G} \sum_{k = 2}^{K} ξ_{g k} I (ξ_{g k} \leq κ)}{\sum_{g = 1}^{G} \sum_{k = 2}^{K} I (ξ_{g k} \leq κ)} .$ (4.1) From the posterior samples of L_gk, ${L_{g k}^{(t)} : t = N_{burn-in} + 1, \dots, N_{total}}$ , we can estimate ${PPI}_{g k} = P (L_{g k} = 1 | X) \approx \frac{1}{N_{total} - N_{burn-in}} \sum_{t = N_{burn-in} + 1}^{N_{total}} L_{g k}^{(t)}$ and further approximate $FDR (κ)$ . If we want to control the FDR at a prespecified threshold α, such as 0.1, we can select κ₀ such that estimated $FDR (κ_{0}) \leq α$ . In other words, if the estimated ${PPI}_{g k} \geq 1 - κ_{0}$ , we claim L_gk as one and zero otherwise.

So far, we have focused on the case where investigators have prior knowledge of the number of subtypes. When K is unknown, the Bayesian information criterion (BIC; Schwarz et al. Citation1978) can be used to select the optimal number of subtypes. The Bayesian information criterion (BIC) formula for BUS is (4.2) $- 2 \cdot [\sum_{b = 1}^{B} \sum_{j = 1}^{n_{b}} log (\sum_{k = 1}^{K} {\hat{π}}_{b k} \prod_{g = 1}^{G} N (y_{b j g}; {\hat{α}}_{g} + {\hat{μ}}_{g k} + {\hat{γ}}_{b g}, {\hat{σ}}_{b g}^{2}))]$ (4.2) (4.3) $+ (K G + (2 B - 1) G) \cdot log (\sum_{b = 1}^{B} n_{b} \cdot G),$ (4.3) where the first term (Equation4.2(4.2) $- 2 \cdot [\sum_{b = 1}^{B} \sum_{j = 1}^{n_{b}} log (\sum_{k = 1}^{K} {\hat{π}}_{b k} \prod_{g = 1}^{G} N (y_{b j g}; {\hat{α}}_{g} + {\hat{μ}}_{g k} + {\hat{γ}}_{b g}, {\hat{σ}}_{b g}^{2}))]$ (4.2) ) approximates two times the negative observed-data log likelihood and the second term (Equation4.3(4.3) $+ (K G + (2 B - 1) G) \cdot log (\sum_{b = 1}^{B} n_{b} \cdot G),$ (4.3) ) is the product of the parameter number and the logarithm of the observation number. ${\hat{π}}_{b k}$ , ${\hat{α}}_{g}$ , ${\hat{μ}}_{g k}$ , ${\hat{γ}}_{b g}$ , and ${\hat{σ}}_{b g}^{2}$ are the posterior mean estimates. We choose the subtype number K such that BIC attains its minimum. Once the optimal number of K is determined, all of the aforementioned analyses follow.

Compared to MetaSparseKmeans (Huo et al. Citation2016), BUS has two advantages for clustering. First, BUS allows only some of the total K subtypes to appear in each batch. In contrast, MetaSparseKmeans requires all subtypes to be measured on every batch, which imposes very strong constraints on the datasets that can be used for meta-analysis and is often infeasible for many experimental designs. Second, BUS avoids the combinatorial matching encountered by MetaSparseKmeans. The computation complexity of BUS for every round of updates is only O(∑^B_{b = 1}n_b · GK), both linear in the number of batches B and in the number of subtypes K, thus significantly outperforming the computational complexity of MetaSparseKmeans O((K!)^{B − 1}), which grows exponentially in the number of batches B and factorially in the number of subtypes K.

4.3. Downstream Analysis

In addition to clustering, BUS also provides an explicit characterization of batch effects and enables correction of the raw input data. To correct batch effects, we can follow a similar approach as the L/S model (Johnson et al. Citation2007). As batch one is always taken as the reference batch, no correction is needed. For batches two to B, the corrected gene expression value ${\hat{y}}_{b j g}$ after removing the “location” and “scale” batch effects can be calculated as (4.4) ${\hat{y}}_{b j g} = {\hat{α}}_{g} + {\hat{μ}}_{g {\hat{Z}}_{b j}} + \frac{y_{b j g} - {\hat{α}}_{g} - {\hat{μ}}_{g {\hat{Z}}_{b j}} - {\hat{γ}}_{b g}}{{\hat{σ}}_{b g} / {\hat{σ}}_{1 g}} .$ (4.4)

The corrected expression values will be free from nonbiological effects and serve as valid data sources for downstream analysis such as differential gene expression detection and gene regulatory network construction.

5. Simulation

In this section, we evaluate the performance of BUS in correcting batch effects, clustering subtypes as well as selecting intrinsic genes via simulation studies. We compare BUS to MetaSpaseKmeans (Huo et al. Citation2016) and a two-stage approach coupling ComBat (Johnson et al. Citation2007) with SparseKmeans (Witten and Tibshirani Citation2012).

5.1. With Complete Subtypes

Following theterminology in Section 3, we first investigate the “complete subtypes” case. In Simulation I, we simulate expression level for G = 10, 000 genes from K = 3 subtypes measured in B = 3 batches. The sample sizes for each batch are (n₁, n₂, n₃) = (100, 110, 120). The subtype proportions for each batch are $π_{1} = (0.2, 0.2, 0.6)$ , $π_{2} = (0.1, 0.8, 0.1),$ and $π_{3} = (0.6, 0.1, 0.3)$ , respectively. For each batch b, we assume the first π_b1 · 100% of simulated samples are from subtype one, the second π_b2 · 100% of simulated samples from subtype two, and the rest of the samples from subtype three.

We then specify the underlying mean gene expression level. α_g is set to two for all of the genes. Recall that subtype effect μ_g1 is constrained to zero for all genes in subtype one. The top 500 genes are chosen to be DE in subtype two compared to subtype one with a mean shift of two: μ_g2 = 2, g = 1…, 500. Similarly, genes 501–1000 are up-regulated in subtype three with a mean difference of two: μ_g3 = 2, g = 501…, 1000. Other than that, none of the remaining genes are DE. In other words, μ_g2 = 0, g = 501…, 1000; μ_g3 = 0, g = 1…, 500; and μ_gk = 0, k = 2, 3 for the rest of the genes. (a) shows the underlying true gene expression level α_g + μ_gk for each gene in each subtype.

Figure 1. Patterns for Simulation I. (a) True subtype mean. Each row represents a gene, and each column corresponds to a subtype. There are in total 10,000 genes and three subtypes. (b) True batch effects. Each row represents a gene and each column corresponds to a batch. There are three batches in total. (c) Observed gene expression. Each row represents a gene and each column is a sample. There are 330 samples. (d) BIC plot. (e) Estimated subtype mean. (f) Estimated batch effects. (g) Corrected gene expression grouped by batches. The samples are first ordered by batch (the top bar) and then ordered by subtype (the bottom bar). (h) Corrected gene expression grouped by subtypes. The samples are first ordered by subtype (the bottom bar) and then ordered by batch (the top bar).

Now, we determine the batch effects. For the additive batch effects in batch 2, as illustrated in (b), we set γ_2g = 3 for genes 1–2000; γ_2g = 2 for genes 2001–4000; γ_2g = 1 for genes 4001–6000; γ_2g = 2 for genes 6001–8000; and γ_2g = 3 for genes 8001–10,000. If we abbreviate such pattern for 2000 consecutive genes as (3,2,1,2,3), then γ_3g are specified to (1,2,3,2,1) in the same fashion for batch three. The multiplicative batch effects are σ²_1g = 0.1, σ²_2g = 0.2, and σ²_3g = 0.15 for all genes.

Finally, we generate all of the raw gene expression values from Model Equation2.5(2.5) $\begin{matrix} Z_{b j} \sim Multinomial (1; π_{b 1}, \dots, π_{b K}); \\ Y_{b j g} \sim N (m_{b g k}, σ_{b g}^{2}) | Z_{b j} = k; \\ m_{b g k} = α_{g} + X_{b j} μ_{g} + γ_{b g} = α_{g} + μ_{g k} + γ_{b g} | Z_{b j} = k . \end{matrix}$ (2.5) . (c) displays the observed data. Obviously, a naive application of traditional clustering methods to the raw data without considering batch effects would fail to identify the true subtypes.

For the prior distributions discussed in Section 4.1, we set the hyper-parameters as follows: m = 1, $τ_{m} = \sqrt{5}$ , $τ_{γ} = \sqrt{5}$ , α = 2, $\tilde{a} = 2$ , $\tilde{b} = 1$ , a_τ = 2, b_τ = 0.005, (a_p, b_p) = (1, 3), and τ_μ1 = 10. We conduct analysis for K from 2 to 10. BIC correctly chooses K = 3 ((d)). The Markov chain produced by the Gibbs sampler converges very quickly according to the EPSR factors (see supplementary materials Section S1 for details and supplementary Figure S1 for trace plots). After 150 iterations of the burn-in period, we collect 150 iterations for posterior inference. (e) and (f) shows the posterior means for subtype means and additive batch effects. Comparing to (a) and (b), BUS accurately estimates the true parameters. Figure 1(g) and 1(h) is the heatmaps for batch-effects-corrected ${\hat{y}}_{b j g}$ , and it can be seen that the corrected gene expression values now no longer suffer from technical noise and reveal their true biological variation.

Besides correcting batch effects, BUS can cluster all samples automatically with the posterior modes of their subtype indicators $\hat{Z} = {{\hat{Z}}_{b j} : 1 \leq b \leq B, 1 \leq j \leq n_{b}}$ . Samples with the same value of ${\hat{Z}}_{b j}$ are grouped together. To measure how close the estimates $\hat{Z}$ are to the underlying truth Z, we adopt the adjusted Rand index (ARI; Hubert and Arabie Citation1985). The ARI is bounded above by one, and the higher the value is, the more consistent the two categorical vectors are. The ARI between $\hat{Z}$ and Z is exactly one, showing that BUS perfectly recovers the underlying true subtypes for all samples.

Next, we study the performance of BUS in identifying intrinsic genes that reflect the difference among subtypes. The intrinsic gene indicator D_g is estimated as $\sum_{k = 2}^{K} {\hat{L}}_{g k}$ , where ${{\hat{L}}_{g k} : 1 \leq g \leq G, 2 \leq k \leq K}$ are estimated by setting κ = 0.5 to control FDR in Equation (Equation4.1(4.1) $FDR (κ) = \frac{\sum_{g = 1}^{G} \sum_{k = 2}^{K} ξ_{g k} I (ξ_{g k} \leq κ)}{\sum_{g = 1}^{G} \sum_{k = 2}^{K} I (ξ_{g k} \leq κ)} .$ (4.1) ) below 0.1. Genes with ${\hat{D}}_{g} > 0$ are regarded as intrinsic genes. BUS claims all of the 1000 assumed intrinsic genes correctly with only one false discovery.

In comparison, for this small-scale dataset, MetaSparseKmeans also has an ARI = 1 and identifies all of the intrinsic genes correctly. Both MetaSparseKmeans and BUS have comparably fast computation time here around 2 min. Nevertheless, noteworthy, the samples from the Markov chain Monte Carlo (MCMC) method (Robert and Casella Citation2013) obtained by BUS can provide the full posterior distribution for all of the parameters, especially quantifying the uncertainties in parameter estimations.

We also test a two-stage approach that we abbreviate as “ComSpa.” ComSpa first applies ComBat to the three batches, then treats the corrected values from all of the batches as a single batch and uses SparseKmeans to cluster samples. However, even after the correction by ComBat, the expression profiles of the same subtype show very distinct patterns in different batches. For example, in supplementary Figure S2, subtype two (colored by blue in the lower color bar) samples on batch one show significantly higher expression values than subtype two samples on batch two for the top 500 intrinsic genes. The issue is that ComBat requires known subtype information or homogenous biological samples to tease out batch effects. When multiple unknown subtypes exist within each batch, ComBat can easily over-correct the biological variations as artificial noise. The problematic batch effects correction in the first-stage leads to the poor clustering performance of ComSpa with a low ARI of 0.546.

5.1.1. Sensitivity Analysis

We conduct sensitivity analysis to check the influence of prior distributions and find that the choices of hyper-parameters have little effect on the posterior inference (see supplementary materials Section S2.1 and supplementary Figure S3), so we fix the same set of hyperparameters as in Simulation I throughout the article.

In addition to the choices of hyper-parameters, we investigate the effects of sample misclassification on batch effects estimation (see supplementary materials Section S2.2). According to supplementary Figure S4, the batch effects estimations are not sensitive to sample misclassification for intrinsic genes and are always accurate for nonintrinsic genes. Therefore, even in cases where some subtypes are wrongly claimed, Equation (Equation4.4(4.4) ${\hat{y}}_{b j g} = {\hat{α}}_{g} + {\hat{μ}}_{g {\hat{Z}}_{b j}} + \frac{y_{b j g} - {\hat{α}}_{g} - {\hat{μ}}_{g {\hat{Z}}_{b j}} - {\hat{γ}}_{b g}}{{\hat{σ}}_{b g} / {\hat{σ}}_{1 g}} .$ (4.4) ) still provides robust corrected expression values.

5.1.2. Varied Signal Strengths

We further test the robustness of BUS under different signal strengths. Notice that there are two ways to vary signals: (a) keep the sample size fixed and change the strengths of the subtype effects and (b) fix the strengths of the subtype effects and vary the sample size.

For scenario a, we fix the sample size as (100, 110, 120) for the three batches, respectively. For each gene g, we set σ²_g1 = 0.1 on batch one, σ²_g2 = 0.2 on batch two, and σ²_g3 = 0.15 on batch three. We represent the signal matrix that encodes the subtype effects as Each column of S(v) corresponds to a subtype, and each row refers to a gene. Specifically, the kth column of S(v), S(v)_k = (μ_1k, …, μ_Gk)^T, represents the subtype effects of all of the genes for subtype k. Consequently, v indicates the signal strength of the subtype effects: the lower the v is, the weaker the signals are. We vary v from 0 to 3 taking values at {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 1, 1.5, 2, 2.5, 3}. For each setting, we apply BUS and calculate the corresponding ARI (see supplementary Figure S5(a)). When v < 0.15, the signals are overwhelmed by noise. However, as long as v ⩾ 0.4 which is about the size of ${max}_{b} σ_{g b} = \sqrt{0.2} \approx 0.45$ , BUS can precisely cluster all of the samples and infer the underlying parameters.

For scenario b, keeping the subtype effect strength v = 2 and the other settings the same as above, we reduce the sample size from (100, 110, 120) to (100, 110, 120) · ω for ω = 0.1, 0.2, …, 0.8, 0.9, 1. For each generated dataset, we apply BUS and calculate the corresponding ARI (see supplementary Figure S5(b)). As long as ω ⩾ 0.2, even with a small-sample size (20, 22, 24) for ω = 0.2, BUS is able to correctly cluster all of the samples and estimate the parameters precisely.

5.1.3. Model Misspecification

BUS assumes that given a subtype the expression levels of different genes are independent, which leads to a diagonal covariance matrix $Σ$ in Equation (Equation2.3(2.3) $\begin{matrix} Y_{j} & \sim & π_{1} N (m_{1}, Σ) + π_{2} N (m_{2}, Σ) + \dots \\ + π_{K} N (m_{K}, Σ), iid j = 1, \dots, n, \end{matrix}$ (2.3) ). However, as is often the case, the expression levels of many genes are correlated, so $Σ$ is no longer diagonal. We incorporate correlation structures into gene expression following a similar simulation setting of Huo et al. (Citation2016), assuming that the expression values for every 20 consecutive intrinsic genes follow a multivariate normal distribution with the covariance matrix sampled from an inverse Wishart distribution W^{− 1}(Φ, 60), where Φ = 0.5I_{20 × 20} + 0.51_{20 × 1}1^T_{20 × 1}. The expressions for the remaining genes, that is, the nonintrinsic genes, are independent. We apply BUS to this model-misspecified setting, and it turns out that BUS successfully identifies the true subtypes for all of the samples ( $ARI = 1$ ) and estimates subtype effects and location batch effects very well (see supplementary materials Section S3.1 and supplementary Figure S6).

In real-world datasets, some samples might be purely noise or outliers and cannot be grouped into any cluster, which are defined as “scattered samples” (Tseng and Wong Citation2005; Maitra and Ramler Citation2009). Although Tseng and Wong (Citation2005) proposed an algorithm under the K-means framework, in principle their resampling ideas are generalizable to detection of scattered samples for other clustering methods. We adapt algorithm A of Tseng and Wong (Citation2005) to BUS and propose an algorithm to detect scattered samples with a given cluster number (see supplementary materials Section S3.2). To illustrate, we add one scattered sample to each batch in Simulation I. In batch b, the expression level of gene g in the scattered sample is drawn from N(4 + γ_gb, σ²_gb). We then carry out the “scattered sample-finding” algorithm. As shown in supplementary Figure S7(a)– S7(c), the scattered sample is successfully identified for each batch.

All in all, BUS is robust to the choices of hyper-parameters, sample misclassification, noise levels, correlation structures, and scattered samples.

For the “complete subtypes” setting, we also test a large-scale dataset, Simulation II, which consists of 10 batches and 5 subtypes. The performance of BUS remains the same, taking 1.32 hr on a 2.6GHz processor (the same processor used across the article) and resulting in zero FDR (κ = 0.5) and an ARI of one, whereas the original MetaSparseKmeans algorithm fails to work using its exhaustive version. Therefore, we run an approximate version MetaSparseMeans using simulated annealing (MetaSparseMeans-SA), which takes 1.73 hr and also has an ARI of one. Thus, MetaSparseMeans-SA has performance comparable to BUS on Simulation II. Nevertheless, as discussed above, BUS can provide the full posterior distribution for all of the parameters. Please see supplementary materials Section S4 and supplementary Table S1 for the details of Simulation II.

5.2. With Missing Subtypes

Now we move on to the scenario “with missing subtypes” as defined in Section 3. We consider a large-scale case, Simulation III, in which there are 10 batches. The sample sizes of the 10 batches are set to (300, 110, 220, 100, 180, 110, 100, 180, 150, 110), respectively. The total number of subtypes across all samples, K, is set to be 5. However, let K_b denote the number of subtypes in batch b, then K_b ⩽ K changes across batches. For illustration, subtype one appears only in batch one and batch eight; subtype five is present only in batch one and two. Supplementary Table S2 shows the values of all of the parameters, and (a)–(c) depicts the assumed subtype means, batch effects as well as the observed data, respectively.

Figure 2. Patterns for Simulation III. (a) True subtype mean. Each row represents a gene, and each column corresponds to a subtype. There are in total 10,000 genes and five subtypes. (b) True batch effects. Each row represents a gene and each column corresponds to a batch. There are 10 batches in total. (c) Observed gene expression. Each row represents a gene and each column is a sample. There are 1560 samples. (d) BIC plot. (e) Estimated subtype mean. (f) Estimated batch effects. (g) Corrected gene expression grouped by batches. The samples are first ordered by batch (the top bar) and then ordered by subtype (the bottom bar). (h) Corrected gene expression grouped by subtypes. The samples are first ordered by subtype (the bottom bar) and then ordered by batch (the top bar).

BIC values ((d)) are calculated for K = 2, 3, …, 10, respectively, and the optimal subtype number is five, in accord with the underlying truth. We conduct the same posterior inference as before, run 3000 MCMC iterations which achieve convergence based on EPSR factors, and collect samples from the last 1500 iterations. The Gibbs Sampler cost 1.73 hr.

Subtype means ((e)) and batch effects ((f)) are correctly recovered once again. The corrected expression data are free from technical artifacts and demonstrate their underlying true biological variability (Figure 2(g) and 2(h)). The ARI between the results produced by BUS and the underlying truth is one again, which indicates the perfect sample clustering of BUS. BUS also finds all the intrinsic genes with no false discovery (κ = 0.5 in Equation (Equation4.1(4.1) $FDR (κ) = \frac{\sum_{g = 1}^{G} \sum_{k = 2}^{K} ξ_{g k} I (ξ_{g k} \leq κ)}{\sum_{g = 1}^{G} \sum_{k = 2}^{K} I (ξ_{g k} \leq κ)} .$ (4.1) )).

In comparison, we run MetaSparseKmeans-SA, SparseKemans, and ComSpa with the prespecified total cluster number set to 5. MetaSparseKmeans-SA and ComSpa are applied to the whole dataset combined from 10 batches, and SparseKmeans is applied to each single batch separately. Finally, we calculate their ARIs with the underlying truth. Across all samples from all batches, the MetaSparseKmeans-SA’s ARI is 0.289 and cost 2.43 hr; ComSpa has an ARI of 0.43 and takes 13.3 hr; and the SparseKmeans consumes less than 1 hr but its ARI is 0.193. All of them have ARIs much lower than BUS’ ARI of one. The main issues are: (a) MetaSparseKmeans and SparseKmeans force all of the K subtypes to be present in all of the batches, as a result, they ignore the fact that K_b’s can be less than K in each individual batch and vary across batches; (b) ComBat, the first stage of ComSpa, cannot filter out batch effects correctly when there are unknown biological variations.

6. Application

In this section, we apply BUS to a breast cancer dataset that is preprocessed by Huo et al. (Citation2016) and used to apply the MetaSparseKmeans model. The study design of the dataset is summarized in . In total, B = 3 batches are analyzed and they come from two different platforms. The batch generated by The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Network Citation2012) was measured using the Agilent platform, and consequently, the measurement unit is log ratio intensity. The two batches generated by Wang et al. (Citation2005) and Desmedt et al. (Citation2007) used the Affymetrix array, and hence, the measurement unit is log intensity. From , we can see that the expression values from the two platforms are of very different ranges and medians. (a) is the heatmap of the raw gene expression combined directly from the three batches. Not only do the batches measured on different platform (batch 1 vs. batches 2 and 3) show distinct patterns due to artifacts, but also the batches measured on the same platform (batch 2 vs. 3) demonstrate strong batch effects. The subtype differences are completely overwhelmed by the batch effects.

Figure 3. Heatmaps for breast cancer datasets. (a) Observed gene expression, where rows represent 11,058 genes and columns represent 533 (TCGA)+260 (Wang et al.)+164 (Desmedt et al.) = 957 samples; (b) Corrected gene expression by BUS.

Table 1. Breast cancer datasets information. Part of the table is reproduced from Table 1 of Huo et al. (Citation2016).

Download CSV Display Table

According to BIC (see supplementary Figure S8), BUS identifies four subtypes, which is consistent with conclusions from existing biomedical literature (Carey et al. Citation2006; Onitilo et al. Citation2009) that there are four main breast cancer subtypes: basal-like, HER2+/ER-, luminal A, and luminal B. However, to have a direct comparison with MetaSparseKmeans (Huo et al. Citation2016) which specifies five subtypes for the same dataset, we deliberately fix K at five instead of using the BIC. After carrying out Gibbs sampling for 4000 iterations which meets the convergence criterion by the EPSR factors (see supplementary materials Section S1), we treat the first 2000 iterations as burn-in and keep the last 2000 iterations. It takes in total 1.81 hr. (a)–(c) plots the three heatmaps for the gene expression values of each batch with samples ordered by learned subtypes, respectively. (b) corresponds to the corrected expression values by BUS, and the heatmap illustrates that the corrected values can be viewed as being measured in the same batch.

Figure 4. Heatmaps for intrinsic genes selected by BUS from the breast cancer dataset. (a) Row-scaled gene expression values from TCGA. The rows are clustered by hierarchical clustering, and the columns are clustered based on the BUS model. (b) Row-scaled gene expression values from Wang et al. The row order is the same as that in (a), and columns are clustered based on the BUS model; (c) Row-scaled gene expression values from Desmedt et al. The row order is the same as that in (a), and columns are clustered based on the BUS model.

Now, for each subtype k and each gene g, we calculate ${\hat{L}}_{g k}$ by controlling the Bayesian FDR in Equation (Equation4.1(4.1) $FDR (κ) = \frac{\sum_{g = 1}^{G} \sum_{k = 2}^{K} ξ_{g k} I (ξ_{g k} \leq κ)}{\sum_{g = 1}^{G} \sum_{k = 2}^{K} I (ξ_{g k} \leq κ)} .$ (4.1) ) at 0.1, which leads to κ = 0.4, and estimate ${\hat{D}}_{g}$ accordingly. Subsequently, 391 genes are selected as the intrinsic genes, which potentially explain the differences among breast cancer subtypes. From the heatmaps for the intrinsic genes (), the subtype patterns are observable for each batch.

In contrast, 203 intrinsic genes are obtained using MetaSparseKmeans (Huo et al. Citation2016). We note that BUS associates each intrinsic gene g with the intrinsic gene indicator D_g ≔ ∑^K_{k = 2}L_gk > 0. Thus, genes with the highest D_g’s show the strongest signals to be intrinsic genes. Therefore, to select the same number of intrinsic genes as that identified by MetaSparseKmeans, we order the intrinsic genes by D_g’s and choose the top 203 intrinsic genes to conduct pathway analysis on the same website http://software.broadinstitute.org/gsea/msigdb/annotate.jsp using the same BioCarta database. The top 203 intrinsic genes produced by BUS lead to 21 significant (q-value less than 0.05) BIOCARTA pathways (see the column “BUS 203” in ). In contrast, the intrinsic gene set on the same dataset selected by MetaSparseKmeans gives seven significant BIOCARTA pathways. Only three enriched BIOCARTA pathways, BIOCARTA_RANMS_PATHWAY, BIOCARTA_MCM_PATHWAY, and BIOCARTA_G1_PATHWAY, overlap between these two intrinsic gene sets. However, the rest of the pathways enriched in BUS analysis are closely related to breast cancer. Especially, the highest-ranked-enriched pathway from BUS analysis is the BIOCARTA_HER2_PATHWAY. Her2 is one of the most important biomarkers in breast cancer (Slamon et al. Citation1987). Actually, overexpression of Her2 is strongly associated with increased disease recurrence and a poor prognosis (Xia et al. Citation2004). Subtype-specific therapy, trastuzumab, targets Her2 and is only effective in Her2-positive patients (Piccart-Gebhart et al. Citation2005). The efficacy of trastuzumab is a classic example of subtype-specific diagnosis and treatment. However, this very important breast cancer pathway—BIOCARTA_HER2_PATHWAY—is completely mis-sed by MetaSparseKmeans. Moreover, IGF-1 (Wolf et al. Citation2008), MTA-3 (Fujita et al. Citation2003), and STATHMIN (Alli et al. Citation2007) are all key pathways for breast cancer. They are significant for BUS but missed by MetaSparseKmeans. These facts strongly support that the intrinsic gene set from BUS offers scientifically meaningful subtyping.

Table 2. Significant (q-value less than 0.05) BIOCARTA pathways identified by different studies. Results from joint analyses of three batches by BUS are compared to that from MetaSparseKmeans on three batches and sparse K-means on each batch separately. “BUS 391” indicates that we use all of the 391 intrinsic genes identified by BUS to carry out pathway analysis. “BUS 203” corresponds to using the top 203 genes with the highest intrinsic gene indicators learned by BUS. “-” represents that the corresponding BIOCARTA pathway is not significant for the corresponding method. The numbers displayed in the table are q-values.

Download CSV Display Table

It is noteworthy that the two-stage approach ComSpa learn more than 4000 intrinsic genes for the same dataset. Such a large number of intrinsic genes tends to include many false discoveries, and it exceeds the largest number of genes allowed for the pathway analysis on http://software.broadinstitute.org/gsea/msigdb/annotate.jsp.

We also use the original 391 intrinsic genes called by BUS to conduct pathway analyses on the BioCarta (see the column “BUS 391” in ), KEGG and GO biological processes, respectively (see supplementary Tables S3 to S5). For the KEGG database, there are 58 significant pathways including the KEGG_ERBB_SIGNALING_PATHWAY (q-value: 6.02 × 10^{− 3}), which plays a crucial role in the development of breast cancer, KEGG_MAPK_SIGNALING_PATHWAY (q-value: 5.76 × 10^{− 5}), and KEGG_PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM (q-value: 0.0181), which are activated by most ErbBs receptors, and the KEGG_PATHWAYS_IN_CANCER (q-value: 1.33 × 10^{− 11}). Meanwhile, more than 100 GO biological processes are significantly enriched. Most of them pertain to the cell cycle, cell proliferation, cell death, cell differentiation, cell communication, all of which are potential factors related to the induction of breast cancer.

The pathway analyses very convincingly show that BUS provides biologically and clinically valid subtyping and outperforms MetaSparseKmeans, let alone sparse K-means applied to individual batch separately.

7. Discussion

To the best of our knowledge, BUS is the first method to explicitly model batch effects and discover underlying subtypes at the same time. Scientifically, BUS is able to integrate batches measured from different platforms in which subtypes can be present on some but not all batches. Statistically, BUS leverages information across batches to estimate subtype effects and borrows strength across genes to cluster subtypes, thus providing robust and legitimate inferences. Computationally, BUS overcomes the factorial growth of computation time, and its computational complexity only grows linearly with the batch number and subtype number. Moreover, BUS models the additive and multiplicative batch effects explicitly. Consequently, we can easily filter the two types of batch effects directly from the raw input data. The corrected data are robust to sample misclassification and can be used for downstream analysis as if they originate from a single batch.

Theoretically, we prove that BUS is identifiable when all subtypes are measured on each batch. In addition, we offer two very convenient experimental designs where subtypes are allowed to be measured on a subset of batches, and we prove the model identifiability under each scenario. We hope these results will provide researchers more freedom and flexibility in designing valid studies that can be protected from batch effects.

During the review period for this article, Jacob et al. (Citation2016) proposed approaches to remove unwanted factors with unknown information of batches or biological phenotypes. However, their approaches require control genes whose expressions are known to be irrelevant to the factors of interest and/or control samples that have replicates in all batches, and their methods can work only when the control genes or control samples are available. Practically, incorporating control genes or control samples can be challenging. In contrast, BUS does not require any control genes, and the experimental designs suggested by Theorems 2 and 3 are more flexible than those that use control samples. BUS only asks for batch information. In real datasets, batch information can often be obtained directly or indirectly such as by tracking series numbers of the assayed arrays or sequencing experiments. Therefore, the requirement of known batch information is not very restrictive. In addition, as mentioned above, BUS explicitly models the batch effects and offers the capability of sample clustering.

BUS serves as a general framework for batch effects correction when unknown subtypes are present. It can be further tailored to adapt to the distributions from other types of high-throughput datasets such as DNA methylation microarrays, next-generation bulk sequencing data, and single-cell sequencing data. In BUS, we assume that given a batch the gene expression profile for a sample follows a mixture of multivariate normal distributions (see Equation (Equation2.3(2.3) $\begin{matrix} Y_{j} & \sim & π_{1} N (m_{1}, Σ) + π_{2} N (m_{2}, Σ) + \dots \\ + π_{K} N (m_{K}, Σ), iid j = 1, \dots, n, \end{matrix}$ (2.3) )). In the same spirit, we can adopt a mixture of beta distributions (Ji et al. Citation2005) to model the DNA methylation values that are between zero and one, a mixture of multivariate Poisson distributions (Karlis and Meligkotsidou Citation2007) to fit RNA-seq count data, and a mixture of zero-inflated Poisson distributions to account for dropout events in the single-cell RNA-seq experiments. Nevertheless, the exact implementations of BUS for each of the above data types, such as computation with nonconjugate priors, are beyond the scope of this article, and they will be our future research directions. For now, to apply the current version of BUS to RNA-seq data and DNA methylation data for convenience, we can first transform the data and then apply BUS to the transformed values. We provide an example of RNA-seq data as a proof of principle in the supplementary materials Section S5.

Given the statistical power and computation efficiency of BUS, we believe that BUS will become a powerful tool. On the one hand, its ability to correct batch effects will substantially facilitate preprocessing of the enormous amount of noisy, heterogenous data in public databases and thus speed up mining of such rich data resources for valid scientific discoveries. On the other hand, its capability to identify subtypes will also help identify subgroups of patients. We provide BUS as a user-friendly free Bioconductor package BUScorrect, and we envision it will be widely adopted in the era of personalized medicine.

Supplementary Materials

The supplementary materials provide technical details and figures referred in the article as well as the datasets used in simulation studies and real application.

Supplemental material

UASA_A_1497494_Supplement.zip

Download Zip (235.4 MB)

References

Alli, E., Yang, J., and Hait, W. (2007), “Silencing of Stathmin Induces Tumor-Suppressor Function in Breast Cancer Cell Lines Harboring Mutant p53,” Oncogene 26, 1003–1012.
Google Scholar
Banfield, J. D., and Raftery, A. E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering,” Biometrics, 49, 803–821.
Web of Science ®Google Scholar
Bickel, P. J., and Levina, E. (2004), “Some Theory for Fisher’s Linear Discriminant Function, ’Naive Bayes’, and Some Alternatives When there are Many More Variables than Observations,” Bernoulli, 10, 989–1010.
Web of Science ®Google Scholar
Carey, L. A., Perou, C. M., Livasy, C. A., Dressler, L. G., Cowan, D., Conway, K., Karaca, G., Troester, M. A., Tse, C. K., Edmiston, S., Deming, S. L., Geradts, J., Cheang, M. C. U., Nielsen, T. O., Moorman, P. G., Earp, H. S., and Millikan, R. C. (2006), “Race, Breast Cancer Subtypes, and Survival in the Carolina Breast Cancer Study,” Journal of the American Medical Association, 295, 2492–2502.
Google Scholar
Casella, G., and Berger, R. L. (2002), “Statistical Inference (Vol. 2), Pacific Grove, CA: Duxbury.
Google Scholar
Chahrour, M., Jung, S. Y., Shaw, C., Zhou, X., Wong, S. T., Qin, J., and Zoghbi, H. Y. (2008), “Mecp2, A Key Contributor to Neurological Disease, Activates and Represses Transcription,” Science, 320, 1224–1229.
Google Scholar
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–38.
Google Scholar
Desmedt, C., Piette, F., Loi, S., Wang, Y., Lallemand, F., Haibe-Kains, B., Viale, G., Delorenzi, M., Zhang, Y., Saghatchian d'Assignies, M., Bergh, J., Lidereau, R., Ellis, P., Harris, A. L., Klijn, J. G. M., Foekens, J. A., Cardoso, F., Piccart, M. J., Buyse, M., and Sotiriou, C. (2007), “Strong Time Dependence of the 76-gene Prognostic Signature for Node-Negative Breast Cancer Patients in the Transbig Multicenter Independent Validation Series,” Clinical Cancer Research, 13, 3207–3214.
Google Scholar
Edgar, R., Domrachev, M., and Lash, A. E. (2002), “Gene Expression Omnibus: NCBI Gene Expression and Hybridization Array Data Repository,” Nucleic Acids Research, 30, 207–210.
Google Scholar
Fraley, C., and Raftery, A. E. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation,” Journal of the American Statistical Association, 97, 611–631.
Web of Science ®Google Scholar
Franks, A. M., Csárdi, G., Drummond, D. A., and Airoldi, E. M. (2015), “Estimating A Structured Covariance Matrix from Multilab Measurements in High-Throughput Biology,” Journal of the American Statistical Association, 110, 27–44.
PubMed Web of Science ®Google Scholar
Fujita, N., Jaye, D. L., Kajita, M., Geigerman, C., Moreno, C. S., and Wade, P. A. (2003), “Mta3, a Mi-2/NuRD Complex Subunit, Regulates An Invasive Growth Pathway in Breast Cancer,” Cell, 113, 207–219.
PubMed Web of Science ®Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2014), Bayesian Data Analysis, (Vol. 2), Boca Raton, FL: Chapman & Hall/CRC.
Google Scholar
Geman, S., and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6, 721–741.
PubMed Web of Science ®Google Scholar
George, E. I., and McCulloch, R. E. (1993), “Variable Selection via Gibbs Sampling,” Journal of the American Statistical Association, 88, 881–889.
Web of Science ®Google Scholar
Hein, A.-M. K., Richardson, S., Causton, H. C., Ambler, G. K., and Green, P. J. (2005), “Bgx: A Fully Bayesian Integrated Approach to the Analysis of Affymetrix GeneChip Data,” Biostatistics, 6, 349–373.
PubMed Web of Science ®Google Scholar
Hicks, S. C., Teng, M., and Irizarry, R. A. (2015), “On the Widespread and Critical Impact of Systematic Bias and Batch Effects in Single-Cell RNA-seq Data,” BioRxiv, 025528.
Google Scholar
Hubert, L., and Arabie, P. (1985), “Comparing Partitions,” Journal of Classification, 2, 193–218.
Web of Science ®Google Scholar
Huo, Z., Ding, Y., Liu, S., Oesterreich, S., and Tseng, G. (2016), “Meta-analytic Framework for Sparse k-Means to Identify Disease Subtypes in Multiple Transcriptomic Studies,” Journal of the American Statistical Association, 111, 27–42.
PubMed Web of Science ®Google Scholar
Irizarry, R. A., Warren, D., Spencer, F., Kim, I. F., Biswal, S., Frank, B. C., Gabrielson, E., Garcia, J. G. N., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S. C., Hoffman, E., Jedlicka, A. E., Kawasaki, E., Martínez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Ye, S. Q., and Yu, W. (2005), “Multiple-Laboratory Comparison of Microarray Platforms,” Nature Methods, 2, 345–350.
PubMed Web of Science ®Google Scholar
Jacob, L., Gagnon-Bartsch, J. A., and Speed, T. P. (2016), “Correcting Gene Expression Data When Neither the Unwanted Variation Nor the Factor of Interest are Observed,” Biostatistics, 17, 16–28.
PubMed Web of Science ®Google Scholar
Ji, Y., Wu, C., Liu, P., Wang, J., and Coombes, K. R. (2005), “Applications of Beta-Mixture Models in Bioinformatics,” Bioinformatics, 21, 2118–2122.
PubMed Web of Science ®Google Scholar
Johnson, W. E., Li, C., and Rabinovic, A. (2007), “Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods,” Biostatistics, 8, 118–127.
PubMed Web of Science ®Google Scholar
Karlis, D., and Meligkotsidou, L. (2007), “Finite Mixtures of Multivariate Poisson Distributions with Application,” Journal of Statistical Planning and Inference, 137, 1942–1960.
Web of Science ®Google Scholar
Leek, J. T. (2014), “svaseq: Removing Batch Effects and Other Unwanted Noise from Sequencing Data,” Nucleic Acids Research, gku864, e161.
PubMed Web of Science ®Google Scholar
Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., et al. (2010), “Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data,” Nature Reviews Genetics, 11, 733–739.
PubMed Web of Science ®Google Scholar
Leek, J. T., and Storey, J. D. (2007), “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis,” PLoS Genetics,3, e161.
PubMed Web of Science ®Google Scholar
Maitra, R., and Ramler, I. P. (2009), “Clustering in the Presence of Scatter,” Biometrics, 65, 341–352.
PubMed Web of Science ®Google Scholar
McCall, M. N., Bolstad, B. M., and Irizarry, R. A. (2010), “Frozen Robust Multiarray Analysis (FRMA),“ Biostatistics, 11, 242–253.
PubMed Web of Science ®Google Scholar
McLachlan, G., and Peel, D. (2004), Finite Mixture Models, New York: Wiley.
Google Scholar
Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004), “Detecting Differential Gene Expression with a Semiparametric Hierarchical Mixture Method,” Biostatistics, 5, 155–176.
PubMed Web of Science ®Google Scholar
Onitilo, A. A., Engel, J. M., Greenlee, R. T., and Mukesh, B. N. (2009), “Breast Cancer Subtypes based on ER/PR and Her2 Expression: Comparison of Clinicopathologic Features and Survival,” Clinical Medicine & Research, 7, 4–13.
PubMedGoogle Scholar
Pan, W., and Shen, X. (2007), “Penalized Model-Based Clustering with Application to Variable Selection,” The Journal of Machine Learning Research, 8, 1145–1164.
Web of Science ®Google Scholar
Peterson, C., Stingo, F. C., and Vannucci, M. (2015), “Bayesian Inference of Multiple Gaussian Graphical Models,” Journal of the American Statistical Association, 110, 159–174.
PubMed Web of Science ®Google Scholar
Piccart-Gebhart, M. J., Procter, M., Leyland-Jones, B., Goldhirsch, A., Untch, M., et al. (2005), “Trastuzumab after Adjuvant Chemotherapy in Her2-Positive Breast Cancer,” New England Journal of Medicine, 353, 1659–1672.
PubMed Web of Science ®Google Scholar
Piccolo, S. R., Sun, Y., Campbell, J. D., Lenburg, M. E., Bild, A. H., and Johnson, W. E. (2012), “A Single-Sample Microarray Normalization Method to Facilitate Personalized-Medicine Workflows,” Genomics, 100, 337–344.
PubMed Web of Science ®Google Scholar
Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras, J.-B., Stephens, M., Gilad, Y., and Pritchard, J. K. (2010), “Understanding Mechanisms Underlying Human Gene Expression Variation with RNA Sequencing,” Nature, 464, 768–772.
PubMed Web of Science ®Google Scholar
Ritter, G. (2014), Robust Cluster Analysis and Variable Selection, Boca Raton, FL: CRC Press.
Google Scholar
Robert, C., and Casella, G. (2013), Monte Carlo Statistical Methods, New York: Springer Science & Business Media.
Google Scholar
Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals of Statistics, 6, 461–464.
Web of Science ®Google Scholar
Slamon, D. J., Clark, G. M., Wong, S. G., Levin, W. J., Ullrich, A., and McGuire, W. L. (1987), “Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the Her-2/Neu Oncogene,” Science, 235, 177–182.
PubMed Web of Science ®Google Scholar
Suárez-Fariñas, M., Shah, K. R., Haider, A. S., Krueger, J. G., and Lowes, M. A. (2010), “Personalized Medicine in Psoriasis: Developing a Genomic Classifier to Predict Histological Response to Alefacept,” BMC Dermatology, 10, 1–8.
PubMedGoogle Scholar
Taub, M. A., Corrada Bravo, H., and Irizarry, R. A. (2010), “Overcoming Bias and Systematic Errors in Next Generation Sequencing Data,” Genome Medicine, 2, 87.
PubMed Web of Science ®Google Scholar
The Cancer Genome Atlas Network (2012), “Comprehensive Molecular Portraits of Human Breast Tumours,” Nature, 490, 61–70.
PubMed Web of Science ®Google Scholar
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288.
Google Scholar
Tseng, G. C., and Wong, W. H. (2005), “Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data,” Biometrics, 61, 10–16.
PubMed Web of Science ®Google Scholar
Wang, S., and Zhu, J. (2008), “Variable Selection for Model-Based High-Dimensional Clustering and its Application to Microarray Data,” Biometrics, 64, 440–448.
PubMed Web of Science ®Google Scholar
Wang, Y., Klijn, J. G., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M. E., Yu, J., Jatkoe, T., Berns, Els M. J. J., Atkins, D., and Foekens, J. A. (2005), “Gene-Expression Profiles to Predict Distant Metastasis of Lymph-Node-Negative Primary Breast Cancer,” The Lancet, 365, 671–679.
PubMed Web of Science ®Google Scholar
Witten, D. M., and Tibshirani, R. (2012), “A Framework for Feature Selection in Clustering,” Journal of the American Statistical Association, 105, 713–726.
Web of Science ®Google Scholar
Wolf, I., Levanon-Cohen, S., Bose, S., Ligumsky, H., Sredni, B., Kanety, H., Kuro-o, M., Karlan, B., Kaufman, B., Koeffler, H. P., and Rubinek, T. (2008), “Klotho: A Tumor Suppressor and A Modulator of the igf-1 and fgf Pathways in Human Breast Cancer,” Oncogene, 27, 7094–7105.
PubMed Web of Science ®Google Scholar
Xia, W., Chen, J.-S., Zhou, X., Sun, P.-R., Lee, D.-F., Liao, Y., Zhou, B. P., and Hung, M.-C. (2004), “Phosphorylation/Cytoplasmic Localization of p21cip1/waf1 is Associated with Her2/neu Overexpression and Provides a Novel Combination Predictor for Poor Prognosis in Breast Cancer Patients,” Clinical Cancer Research, 10, 3815–3824.
PubMed Web of Science ®Google Scholar
Yakowitz, S. J., and Spragins, J. D. (1968), “On the Identifiability of Finite Mixtures,” The Annals of Mathematical Statistics, 39, 209–214.
Google Scholar
Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo, W. L. (2001), “Model-Based Clustering and Data Transformations for Gene Expression Data,” Bioinformatics, 17, 977–987.
PubMed Web of Science ®Google Scholar

Appendix: Proofs

A1. Theorem 1

Proof.

Writing $Y = {Y_{b}, 1 \leq b \leq B}$ , then the marginal distribution of $f (Y_{b} | Θ)$ reduces to a Gaussian mixture model on batch b. Therefore, the BUS model can be viewed as a combination of B Gaussian mixture models sharing common parameters $α, μ_{1}, \dots, μ_{K}$ ( $μ_{1} = 0$ ). In other words, we can consider it as B Gaussian mixture models each with parameter sets ${α^{(b)}, μ_{2}^{(b)}, \dots, μ_{K}^{(b)}, γ_{b}, σ_{b}}$ under the constraint that $μ_{k}^{(1)} = \dots = μ_{k}^{(B)} = μ_{k}$ for each k and $α^{(1)} = \dots = α^{(B)} = α$ .

Within Gaussian mixture model b, it is identifiable (up to label switching) (Yakowitz and Spragins Citation1968; Ritter Citation2014), so there exists a permutation of {1, 2, …, K}, ρ^(b), such that $π_{b k} = π_{b ρ^{(b)} (k)}^{*}, σ_{b} = σ_{b}^{*},$ $α^{(b)} + μ_{k}^{(b)} + γ_{b} = α^{* (b)} + μ_{ρ^{(b)} (k)}^{* (b)} + γ_{b}^{*} .$ Because $α^{(b)} = α, α^{* (b)} = α^{*}, μ_{k}^{(b)} = μ_{k},$ and $μ_{k}^{* (b)} = μ_{k}^{*}$ , we have (A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) Recall that $μ_{1} = 0$ , let k = 1 in Equation (EquationA.1(A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) ), then we have (A.2) $α + γ_{b} = α^{*} + μ_{ρ^{(b)} (1)}^{*} + γ_{b}^{*},$ (A.2) Combining Equation (EquationA.1(A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) ) and Equation (EquationA.2(A.2) $α + γ_{b} = α^{*} + μ_{ρ^{(b)} (1)}^{*} + γ_{b}^{*},$ (A.2) ) leads to $μ_{k} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(b)} (1)}^{*}$ for every b, k. Specifically, $μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(b)} (1)}^{*} = μ_{ρ^{(1)} (k)}^{*} - μ_{ρ^{(1)} (1)}^{*}$ for every b and k. By Assumption I, ρ^(b) = ρ⁽¹⁾ for each b. Now denote ρ = ρ⁽¹⁾ = ⋅⋅⋅ = ρ^(B), we have π_bk = π*_bρ(k), and $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ (k)}^{*} + γ_{b}^{*}$ for each b, k. In Equation (EquationA.1(A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) ), let (k, b) = (1, 1) and k = 1, respectively, then $\begin{matrix} α & = & α^{*} + μ_{ρ (1)}^{*}, \\ α + γ_{b} & = & α^{*} + μ_{ρ (1)}^{*} + γ_{b}^{*} . \end{matrix}$ Consequently, we have $γ_{b} = γ_{b}^{*}$ . Therefore, BUS is identifiable (up to label switching).

A2. Theorem 2

Proof.

We just need to prove that ρ^(b)(k) = ρ⁽¹⁾(k) for each k ∈ C_b (b ⩾ 2). Note that Equation (EquationA.1(A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) ) still holds for k ∈ C_b. In Equation (EquationA.1(A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) ), let b = 1 and since $γ_{1} = 0$ , then (A.3) $α + μ_{k} = α^{*} + μ_{ρ^{(1)} (k)}^{*}, \forall k \in C_{b} .$ (A.3) Equation (EquationA.1(A.1) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} .$ (A.1) ) minus Equation (EquationA.3(A.3) $α + μ_{k} = α^{*} + μ_{ρ^{(1)} (k)}^{*}, \forall k \in C_{b} .$ (A.3) ) gives $γ_{b} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(1)} (k)}^{*} + γ_{b}^{*}$ for every k ∈ C_b. Therefore, for any distinct k₁, k₂ ∈ C_b, we have $μ_{ρ^{(b)} (k_{2})}^{*} - μ_{ρ^{(b)} (k_{1})}^{*} = μ_{ρ^{(1)} (k_{2})}^{*} - μ_{ρ^{(1)} (k_{1})}^{*}$ , which implies that ρ^(b)(k) = ρ⁽¹⁾(k) for each k ∈ C_b by Assumption I.

A3. Theorem 3

Proof.

Our goal is to prove that, given a batch b, for each $k \in ⋃_{i \neq b} (C_{b} \cap C_{i})$ , assuming k also belongs to $C_{\tilde{b}}$ , the equation $ρ^{(b)} (k) = ρ^{(\tilde{b})} (k)$ always holds. We separate the proof into two steps.

We first prove ρ^(b)(k) = ρ^{(b − 1)}(k) for each $k \in C_{b} \cap C_{b - 1}$ . Note that (A.4) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} k \in C_{b},$ (A.4) (A.5) $α + μ_{k} + γ_{b - 1} = α^{*} + μ_{ρ^{(b - 1)} (k)}^{*} + γ_{b - 1}^{*} k \in C_{b - 1} .$ (A.5) Let Equation (EquationA.4(A.4) $α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*} k \in C_{b},$ (A.4) ) minus (EquationA.5(A.5) $α + μ_{k} + γ_{b - 1} = α^{*} + μ_{ρ^{(b - 1)} (k)}^{*} + γ_{b - 1}^{*} k \in C_{b - 1} .$ (A.5) ) for $k \in C_{b} \cap C_{b - 1}$ , and then we have (A.6) $γ_{b} - γ_{b - 1} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(b - 1)} (k)}^{*} + γ_{b}^{*} - γ_{b - 1}^{*} .$ (A.6) For each distinct $k_{1}, k_{2} \in C_{b} \cap C_{b - 1}$ , Equation (EquationA.6(A.6) $γ_{b} - γ_{b - 1} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(b - 1)} (k)}^{*} + γ_{b}^{*} - γ_{b - 1}^{*} .$ (A.6) ) implies that $μ_{ρ^{(b)} (k_{2})}^{*} - μ_{ρ^{(b)} (k_{1})}^{*} = μ_{ρ^{(b - 1)} (k_{2})}^{*} - μ_{ρ^{(b - 1)} (k_{1})}^{*}$ . By Assumption I, we proved ρ^(b)(k) = ρ^{(b − 1)}(k) provided that $k \in C_{b} \cap C_{b - 1}$ .

Second, for any $k \in ⋃_{i \neq b} (C_{b} \cap C_{i})$ , there exists a $\tilde{b} \neq b$ such that $k \in C_{\tilde{b}}$ . Without loss of generality, we assume $\tilde{b} < b$ . From the two equations $\begin{matrix} α + μ_{k} + γ_{\tilde{b}} = α^{*} + μ_{ρ^{(\tilde{b})} (k)}^{*} + γ_{\tilde{b}}^{*}, \\ α + μ_{k} + γ_{b} = α^{*} + μ_{ρ^{(b)} (k)}^{*} + γ_{b}^{*}, \end{matrix}$ we have that (A.7) $γ_{b} - γ_{\tilde{b}} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(\tilde{b})} (k)}^{*} + γ_{b}^{*} - γ_{\tilde{b}}^{*} .$ (A.7) Meanwhile, ρ^(b)(k) = ρ^{(b − 1)}(k) for $k \in C_{b} \cap C_{b - 1}$ proven in the first step turns Equation (EquationA.6(A.6) $γ_{b} - γ_{b - 1} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(b - 1)} (k)}^{*} + γ_{b}^{*} - γ_{b - 1}^{*} .$ (A.6) ) into $γ_{b} - γ_{b - 1} = γ_{b}^{*} - γ_{b - 1}^{*},$ for each b ⩾ 2. As a result, $γ_{b} - γ_{\tilde{b}} = \sum_{i = \tilde{b}}^{b - 1} (γ_{i + 1} - γ_{i}) = \sum_{i = \tilde{b}}^{b - 1} (γ_{i + 1}^{*} - γ_{i}^{*}) = γ_{b}^{*} - γ_{\tilde{b}}^{*}$ . Consequently, $μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(\tilde{b})} (k)}^{*} = 0$ according to Equation (EquationA.7(A.7) $γ_{b} - γ_{\tilde{b}} = μ_{ρ^{(b)} (k)}^{*} - μ_{ρ^{(\tilde{b})} (k)}^{*} + γ_{b}^{*} - γ_{\tilde{b}}^{*} .$ (A.7) ). By Assumption I, $ρ^{(b)} (k) = ρ^{(\tilde{b})} (k)$ . Thus, we have proved the identifiability in Theorem 3.

Batch Effects Correction with Unknown Subtypes

ABSTRACT

1. Introduction