Full article: Bayesian Multivariate Logistic Regression for Superiority and Inferiority Decision-Making under Observable Treatment Heterogeneity

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The effects of treatments may differ between persons with different characteristics. Addressing such treatment heterogeneity is crucial to investigate whether patients with specific characteristics are likely to benefit from a new treatment. The current paper presents a novel Bayesian method for superiority decision-making in the context of randomized controlled trials with multivariate binary responses and heterogeneous treatment effects. The framework is based on three elements: a) Bayesian multivariate logistic regression analysis with a Pólya-Gamma expansion; b) a transformation procedure to transfer obtained regression coefficients to a more intuitive multivariate probability scale (i.e., success probabilities and the differences between them); and c) a compatible decision procedure for treatment comparison with prespecified decision error rates. Procedures for a priori sample size estimation under a non-informative prior distribution are included. A numerical evaluation demonstrated that decisions based on a priori sample size estimation resulted in anticipated error rates among the trial population as well as subpopulations. Further, average and conditional treatment effect parameters could be estimated unbiasedly when the sample was large enough. Illustration with the International Stroke Trial dataset revealed a trend toward heterogeneous effects among stroke patients: Something that would have remained undetected when analyses were limited to average treatment effects.

Keywords:

1. Introduction

The current paper focuses on estimating treatment effects among populations, subpopulations, and individual patients in the context of two-arm randomized controlled trials (RCTs) with multiple (correlated) binary dependent variables. Such RCTs are randomized experiments with subjects being assigned at random to either an experimental or a control group, often having the objectives a) to evaluate whether an experimental treatment is superior or inferior to a control condition; and b) to inform the prescription of treatments to patients in (clinical) practice (Food & Drug Administration, Citation2016). Although RCTs are broadly applicable to experimental research in general, we focus on the health domain and use the word “treatment” to refer to psychological and medical interventions in the broad sense. These interventions include - but are not limited to – behavioral therapies, pharmacological support, and other experimental types of care.

Such RCTs often assess multiple types of (clinical) events (e.g., quitting substance abuse, death), functional measures (e.g., memory decline, ability to walk), or disease symptoms (e.g., fatigue, anxiety) (Food & Drug Administration, Citation2017). Studying multiple dependent variables in RCTs is useful, since multiple dependent variables provide multidimensional insights into the effects of a treatment and since analyzing multiple dependent variables together has the potential to improve the connection between clinical and statistical desicion-making. More specifically, multiple effects of the intervention can be combined and weighted in various ways to provide a single statistical decision regarding superiority or inferiority, similar to decisions regarding treatment prescription made by therapists or clinicians (e.g., Pocock et al., Citation1987; O’Brien, Citation1984; Murray et al., Citation2016). Whereas performing multiple univariate analyses on individual dependent variables is a common strategy to deal with data from multiple dependent variables, a single multivariate analysis is often preferable from a statistical point of view (Senn & Bretz, Citation2007; Ristl et al., Citation2019; Food & Drug Administration, Citation2017; Murray et al., Citation2016). Multivariate analysis takes the correlation between dependent variables into account and therefore has the potential to reduce decision errors: Correlations influence the sample sizes required for decision-making with prespecified error rates and provoke under- or overpowerment when falsely omitted (Chow et al., Citation2017; Sozu et al., Citation2010; Xiong et al., Citation2005).

RCTs often focus on average treatment effects (ATEs) among the study population when comparing interventions (Thall, Citation2020). Average treatment effects can be sufficiently insightful when the effects of a treatment are relatively homogeneous over the trial population. In this case, patients react relatively similarly to the treatment. However, average effects may give a limited, or even erroneous, impression when the actual effects of a treatment are heterogeneous and thus interact with characteristics of patients. In that case, patients differ in their reactions to the treatment. Taking characteristics of patients into account in the estimation of treatment effects (i.e., estimating conditional average treatment effects; CATEs) can then contribute to a better understanding of the treatment’s potential for an individual patient. Despite efforts to provide statistical methodology to model CATEs (e.g., Wang et al., Citation2015; Yang et al., Citation2021; Jones et al., Citation2011), investigating these effects is not the standard yet: Thall noted that ”the great majority of clinical trial designs ignore the possibility of treatment-covariate interactions, and often ignore patient heterogeneity entirely” (Thall, Citation2020, p.1). This is unfortunate as addressing conditional effects in the evaluation of treatments is crucial to a) identify how likely a specific patient will benefit from a treatment; and b) optimize treatment results of individual patients via personalized treatment assignment (Goldberger & Buxton, Citation2013; Hamburg & Collins, Citation2010; Wang et al., Citation2015; Simon, Citation2010).

An example of a trial with multiple dependent variables and potential treatment heterogeneity is the International Stroke Trial (IST; Sandercock et al., Citation2011; International Stroke Trial Collaborative Group, Citation1997). Strokes may have far-reaching implications for the quality of life, as they may be recurring and/or lead to long-term impaired (daily) functioning. The IST investigated whether the short-term and long-term perspective of stroke patients can be improved with anti-thrombotic drug therapy. The average treatment differences in the IST were small, so one might conclude that treatment with one of these drugs was marginally effective. However, these overall findings were based on the assumption that specific characteristics of patients (e.g., sex or age) and/or disease (e.g., type of stroke or functional status after stroke) did not interact with the treatment to produce different effects for different patients. Average treatment effects could, for example, not reveal whether older patients have better prospects in terms of short-term damage risk and/or long-term recovery potential than younger patients. Clearly, hypothetical heterogeneous effects as these would have clinically and psychologically relevant implications and advocate the development of more personalized treatment policies.

While multivariate treatment effects for patients with specific characteristics are theoretically relevant for many contemporary RCTs contributing to the personalization of treatments, decision-making under treatment heterogeneity in the multivariate context is considerably more complex compared to the non-heterogeneous and/or univariate setting. Generalizations to the heterogeneous and multivariate context are subject to assumptions that need to be carefully evaluated in light of the research problem at hand. First, the multivariate setting demands an analysis method that incorporates the correlation between dependent variables (i.e., a multivariate analysis method) to obtain accurate decision error rates (e.g., Sozu et al., Citation2010, Citation2016; Kavelaars et al., Citation2020). Ignoring or misspecifying a non-zero correlation can result in over- or underestimation of the required sample size and thus affects the statistical power of the analysis. For accurate inference regarding conditional average treatment effects, the analysis should not only include the overall correlation among the trial population, but should also be flexible enough to deal with correlations that differ over subpopulations. The latter is not evident in existing multivariate analysis methods for binary dependent variables: Some methods impose the marginal correlation structure of the trial population on subpopulations (e.g., multivariate probit models by Chib (Citation1995) or Rossi et al. (Citation2005) and multivariate logit models by Malik and Abraham (Citation1973) and O’Brien and Dunson (Citation2004)). Second, the interpretation of treatment effects can be complex in multivariate non-linear models. Creating insights into so-called marginal effects (i.e., treatment effects on the individual dependent variables) is recommended in treatment comparison, demanding any multivariate method to return interpretable univariate effects (Food & Drug Administration, Citation2017; O’Brien & Dunson, Citation2004). Some existing multivariate models lack insight into marginal distributions (e.g Malik & Abraham, Citation1973). Third, some multivariate methods estimate a single regression parameter to capture the relation between a covariate and all dependent variables (e.g., O’Brien & Dunson, Citation2004; Rossi et al., Citation2005). The latter assumes that all dependent variables vary identically over the full support of the covariate. In other words, all relations between the covariate and the outcome variable have the same size and direction. Clearly, such an assumption may be too strict to hold in practice.

In order to deal with the complexity of heterogeneous, multivariate treatment effects, we build upon an existing Bayesian multivariate Bernoulli framework for superiority decision-making proposed by Kavelaars et al. (Citation2020). The existing procedure consists of three major components: a) a multivariate analysis model to estimate unknown parameters; b) a transformation procedure to interpret treatment effects on the (more intuitive) probability scale; and c) a compatible decision procedure to make inferences regarding treatment superiority with prespecified error rates. The analysis procedure has advantages over several other approaches, as it relies on a multinomial distribution and therefore has the flexibility to model univariate effects and correlations between dependent variables. The transformation procedure facilitates the interpretation of treatment comparison: marginal (i.e., univariate) probabilities, multivariate probabilities, and differences between (multivariate) probabilities can be used in inference as well. The decision procedure is suitable for Bayesian inference and can be flexibly applied with several decision rules for multiple dependent variables. Noteworthy is a decision rule with a compensatory mechanism, that can weigh dependent variables by their importance and has a natural compensatory mechanism that can balance positive and negative treatment effects. With this decision procedure, decisions regarding treatment superiority can be made with targeted decision error rates (i.e., Type I and Type II errors) and a priori computed sample sizes.

Kavelaars et al. (Citation2020) proposed a multivariate Bernoulli model for multivariate Bernoulli outcomes to estimate average treatment effects and to make decisions based on multivariate treatmeat effects. In the current paper we propose a more flexible modeling framework for multivariate Bernoulli outcomes using Bayesian multivariate logistic regression models. This extension allows us to model and estimate multivariate treatment effects for different (sub)populations based on available covariate information. Moreover, to make decisions about multivariate treatment effects for these different subpopulations, we extend the decision procedure of Kavelaars et al. (Citation2020) to the new multivariate logistic regression model. Additionally, sample size recommendations are provided for estimating and decision-making under this framework.

Note that the proposed multivariate modeling framework aims to estimate heterogeneous multivariate treatment effects that are caused by observed covariate information and to make decisions about treatment superiority. Thereby, the aim is different from mixture modeling which aims to capture unobserved (treatment) heterogeneity using latent variables (either discrete or continuous). Mixture models use response data to cluster respondents based on their patterns of outcome data (e.g., patterns of symptoms), where each cluster has an individual distribution that forms a constituent of the mixture (McLachlan et al., Citation2019). The proposed regression model does not include latent variables (either discrete or continuous) to capture unobserved heterogeneity. Instead, multivariate (logistic) regression uses observed covariate information to define patient groups of interest, often based on theoretical (such as accepted cutoff values for high and low blood pressure) or statistical (such as those respondents with more extreme scores than one standard deviation below or above the mean) grounds. Subgroups are thus bounded by criteria specified by the researcher, rather than by response patterns in the data.

The paper is organized as follows. In the next section, we introduce the decision framework, including the multivariate logistic regression model to obtain a sample from the multivariate posterior distribution of regression coefficients, a transformation procedure to find posterior treatment differences, and a decision procedure to draw conclusions regarding treatment superiority and inferiority. The section on capturing heterogeneity explains how the framework can be applied to different patient populations. We evaluate frequentist operating characteristics of the framework via simulation in the numerical evaluation section. Next, we illustrate the methods with data from the International Stroke Trial and conclude the paper with a discussion.

2. Decision-framework

2.1. Multivariate logistic regression

Response $y_{i}^{k}$ is the binary response for subject i on outcome variable $k \in {1, \dots, K},$ where $y_{i}^{k} \in {0, 1},$ $0 =$ failure and $1 =$ success. Vector $y_{i} = (y_{i}^{1}, \dots, y_{i}^{K})$ is the multivariate (or joint) binary response vector of subject i on K dependent variables and has configuration $H_{q \cdot},$ which is one of the $Q = 2^{K}$ possible response combinations of length K given in the $q^{th}$ row of matrix $H :$ (1) $\begin{matrix} H = & [\begin{matrix} 1 & 1 & \dots & 1 & 1 \\ 1 & 1 & \dots & 1 & 0 \\ \dots \\ 0 & 0 & \dots & 0 & 1 \\ 0 & 0 & \dots & 0 & 0 \end{matrix}] \end{matrix}$ (1)

The probability of $y_{i}$ can be expressed in two meaningful and related ways. First, $θ_{i} = (θ_{i}^{1}, \dots, θ_{i}^{K})$ denotes the vector of K-variate success probabilities on individual outcome $1, \dots, K,$ where $θ_{i}^{k} = p (y_{i}^{k} = 1) .$ Second, $ϕ_{i} = (ϕ_{i}^{1}, \dots, ϕ_{i}^{Q})$ denotes the vector of Q-variate joint response probabilities, where $ϕ_{i}^{q} = p (y_{i} = H_{q \cdot})$ and sums to unity. The joint response of subject i can be conditioned on covariates in vector $x_{i} = (x_{i 1}, \dots, x_{iP}) .$ In this case, the probabilities of response vector $y_{i} | x_{i}$ are expressed as functions of $x_{i},$ namely $ϕ_{i} (x_{i})$ and $θ_{i} (x_{i}) .$

Joint response probability $ϕ_{i}^{q} (x_{i})$ maps the dependency of joint response probabilities on covariates $x_{i}$ via a multinomial logistic function: (2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) for response categories $q = 1, \dots, Q - 1 .$ In EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) , $ψ_{i}^{q} (x_{i})$ reflects the linear predictor of response category q and subject i: (3) $\begin{matrix} ψ_{i}^{q} (x_{i}) = & β_{0}^{q} + β_{1}^{q} x_{i 1} + \dots + β_{P}^{q} x_{iP} . \end{matrix}$ (3)

Here, $x_{ip}$ can be a treatment indicator, a patient characteristic, or an interaction between these. Vector $β^{q} = (β_{0}^{q}, β_{1}^{q}, \dots, β_{P}^{q})$ is the vector of regression coefficients of response category q. To ensure identifiability, all regression coefficients of response category Q are fixed at zero, i.e., $β^{Q} = 0 .$

The likelihood of response data follows from taking the product over n individual joint response probabilities from EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) of Q response categories: (4) $\begin{matrix} l (y | β, x) = & \prod_{i = 1}^{n} \prod_{q = 1}^{Q - 1} {(\frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1})}^{I (y_{i} = H_{q \cdot})} {(\frac{1}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1})}^{I (y_{i} = H_{Q \cdot})} . \end{matrix}$ (4)

Bayesian analysis is done via the posterior distribution which is given by (5) $\begin{matrix} p (β^{q} | y) \propto & p (y | β^{q}) p (β^{q}), \end{matrix}$ (5) where $p (β^{q})$ reflects the prior distribution of the unknown parameters before observing the data. Posterior sampling can be done with a Gibbs sampling algorithm based on a Pólya-Gamma expansion (Polson et al., Citation2013). Computational details of this procedure can be found in Appendix A.

2.2. Transformation to treatment differences

In contrast to several other regression analyses, the obtained multinomial regression coefficients have no straightforward interpretation. We aim to make the posterior sample of regression coefficients interpretable in terms of a treatment difference, which is defined as the (multivariate) difference between success probabilities of two treatments. To this end, we execute the following multistep procedure with a fictive setup of the IST trial as running example.

Suppose we are interested in the effect of combined drug therapy (Heparin plus Asparin; $T_{H + A}$ ) vs. single drug therapy (Aspirin only; $T_{A}$ ) on recurrent stroke on the short-term ( $y^{strk}$ ) and dependency on the long-term ( $y^{dep}$ ). There is a total of $Q = 4$ response categories: ${y^{strk} = 1, y^{dep} = 1},$ ${y^{strk} = 1, y^{dep} = 0},$ ${y^{strk} = 0, y^{dep} = 1},$ ${y^{strk} = 0, y^{dep} = 0},$ which we refer to as ${11},$ ${10},$ ${01},$ and ${00}$ respectively. The treatments are blood thinning agents and may thus interact with the patient’s blood pressure. Therefore, we include systolic blood pressure at the time of randomization as a covariate, so that we can estimate conditional effects for patients with different values of blood pressure, resulting in the following model: (6) $\begin{matrix} ψ_{i}^{q} (x_{i}) = & β_{0}^{q} + β_{1}^{q} T_{i} + β_{2}^{q} b p_{i} + β_{2}^{q} b p_{i} T_{i}, \end{matrix}$ (6) where $x_{i} = (T_{i}, b p_{i}, b p_{i} T_{i}) .$ The transformation procedure is then as follows:

Regression coefficients $β$ to joint response probabilities $ϕ_{T} (x)$ : In the first step, the posterior sample of regression coefficients $β$ is transformed to a treatment effect in terms of joint response probabilities $ϕ_{Ti} (x_{i})$ for each treatment $T \in {0, 1} .$ Linear predictor $ψ_{i}^{q} (x_{i})$ is then transformed to individual joint response probability $ϕ_{i}^{q} (x_{i})$ via the multinomial logistic function in EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) : (2 revisited) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2 revisited)
For example, the probability that patient i in the IST does not experience a new stroke and is dependent after six months can be expressed as: (7) $\begin{matrix} \begin{matrix} ϕ_{T_{i}}^{3} (x_{i}) = p (y_{i} (x_{i}) = {01}) \end{matrix} \\ \begin{matrix} = \frac{exp [ψ_{i}^{3} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} . \end{matrix} \end{matrix}$ (7) This probability can be computed for the other joint response combinations as well. Note that we are in fact interested in joint response probability $ϕ_{T} (x),$ which reflects a treatment effect among a (sub)population defined by $x .$ This notation is more general than the joint response probability of an individual patient with covariates $x_{i} .$ The population can be reflected by an individual patient (e.g., with a systolic blood pressure of 100) in some situations, while other cases target the entire study population (e.g., no restriction on systolic blood pressure) or a subpopulation of interest (e.g., with a systolic blood pressure above 150). These variations have slightly different computational procedures, which we discuss in more detail in Section 3.
Joint response probabilities $ϕ_{T} (x)$ to multivariate success probabilities $θ_{T} (x)$ : The next step in the transformation involves the conversion from joint response probabilities $ϕ_{T} (x)$ to multivariate success probabilities of individual dependent variables $θ_{T} (x) .$ Especially when the number of dependent variables increases, success probabilities are more straightforward in their interpretation than joint response probabilities. The relation between both quantities is additive: Success probability $θ_{T}^{k}$ on outcome k and treatment T equals the sum of a selection of elements of $ϕ_{T},$ denoted by matrix $U_{k} :$ (8) $\begin{matrix} θ_{T}^{k} (x) = \sum_{q = 1}^{Q} ϕ_{T}^{q} (x) I (H_{q \cdot} \in U_{k}) . \end{matrix}$ (8) Selection $U_{k}$ consists of the $2^{K - 1}$ rows of $H$ that have their $k^{th}$ element equal to 1. More concretely, the two dependent variables from the IST are the following combinations, where we drop the dependency on $x$ for notational simplicity. $H = [\begin{matrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{matrix}], U_{strk} = [\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}], and U_{dep} = [\begin{matrix} 1 & 1 \\ 0 & 1 \end{matrix}] .$
Hence, the multivariate success probabilities in $θ_{T} = (θ_{T}^{strk}, θ_{T}^{dep})$ consists of univariate success probabilities: (9) $\begin{matrix} θ_{T}^{strk} = p (y_{i} (x_{i}) = {11}) + p (y_{i} (x_{i}) = {10}) \\ = ϕ_{T}^{1} + ϕ_{T}^{2} \\ θ_{T}^{dep} = p (y_{i} (x_{i}) = {11}) + p (y_{i} (x_{i}) = {01}) \\ = ϕ_{T}^{1} + ϕ_{T}^{3} . \end{matrix}$ (9) The correlation between these dependent variables is captured in joint response probabilities $ϕ_{T} (x)$ and automatically taken into account in further transformations (Olkin & Trikalinos, Citation2015; Dai et al., Citation2013).
Success probabilities $θ_{T} (x)$ to treatment differences $δ (x)$ : The treatment difference on outcome k, $δ^{k} (x),$ is defined as the difference between the success probabilities of two treatments on outcome k, such that: (10) $\begin{matrix} δ^{k} (x) = θ_{1}^{k} (x) - θ_{0}^{k} (x) . \end{matrix}$ (10)

The K-variate treatment difference is then $δ (x) = (δ^{1} (x), \dots, δ^{K} (x)) .$

Multivariate treatment difference $δ = (δ^{strk}, δ^{dep})$ in the IST is a vector of the univariate treatment differences: (11) $\begin{matrix} \begin{matrix} δ^{strk} = θ_{H + A}^{strk} - θ_{A}^{strk} \end{matrix} \\ \begin{matrix} δ^{dep} = θ_{H + A}^{dep} - θ_{A}^{dep} . \end{matrix} \end{matrix}$ (11)

Applying the three above-mentioned steps to each draw of the posterior sample of $β,$ results in a posterior sample of multivariate treatment difference $δ (x) .$ This sample provides estimates that can be used for prediction, where various measures of central tendency (e.g., a mean or high posterior density interval) can be used to summarize the sample into a point estimate. Moreover, the sample can be used for statistical inference to generalize the conclusion to the specified (sub)population, as outlined in the next subsection.

2.3. Posterior decision-making

Decisions rely on estimated treatment effects, such as differences between success probabilities, and their uncertainties. More formally, multivariate treatment difference $δ$ has complete parameter spaces $S \subset {[- 1, 1]}^{K},$ which is divided into a rejection region $S_{R}$ and a non-rejection region $S_{N} .$ Rejection region $S_{R}$ reflects the part of the parameter space that indicates the treatment difference of interest, where we would conclude that the treatments differ. The non-rejection region $S_{N}$ refers to the part of the parameter space that would not be considered a (relevant) treatment difference. Rejection regions depend on the type of decision and be composed of multiple subregions if desired (Van Ravenzwaaij et al., Citation2019). We consider the following three (commonly used) decision types:

superiority with region $S_{R} \in S_{S},$ where the treatment is better;
inferiority with region $S_{R} \in S_{I},$ where the treatment is worse;
two-sided with rejection region $S_{R} \in {S_{S}, S_{I}},$ where the treatment can be either better or worse.

We consider evidence sufficiently strong and would conclude superiority and/or inferiority when the posterior probability that treatment difference $δ (x)$ lies in the rejection region exceeds a prespecified decision threshold, $p_{cut} :$ (12) $\begin{matrix} p (δ (x) \in S_{R} | y) > p_{cut} . \end{matrix}$ (12)

When the functional form of the posterior distribution is unknown, the rejection probability can be concluded from an MCMC sample of L draws from the posterior distribution of $δ (x) .$ EquationEquation 12(12) $\begin{matrix} p (δ (x) \in S_{R} | y) > p_{cut} . \end{matrix}$ (12) is then applied in practice as: (13) $\begin{matrix} \frac{1}{L} \sum_{(l) = 1}^{L} I (δ^{(l)} (x) \in S_{R} | y) > p_{cut} . \end{matrix}$ (13)

In a situation with multiple dependent variables, superiority and inferiority can be defined in multiple ways, resulting in different rejection regions (e.g Pocock et al., Citation1987; Pocock, Citation1997; O’Brien, Citation1984; Prentice, Citation1997; Tang et al., Citation1993; Zhao et al., Citation2007). Although not intended as an exhaustive overview, we list three possible rules and graphically present their rejection regions in . Two of these rules (which we refer to as the “Any” and “All” rules) are presented as part of the regulatory guideline regarding multiple endpoints, as presented by the Food and Drug Administraction Food and Drug Administration (Citation2017) and have been extensively discussed in literature (e.g., Chuang-Stein et al., Citation2006; Sozu et al., Citation2010, Citation2016; Xiong et al., Citation2005). The third rule (“Compensatory”) is a - relatively unknown - flexible alternative that weighs benefits and risks of treatments by their (clinical) relevance (Murray et al., Citation2016; Kavelaars et al., Citation2020). A more elaborate comparison of these rules can be found in Kavelaars et al. (Citation2020).

Any rule: The Any rule results in superiority or inferiority when the difference between success probabilities is larger or smaller than zero respectively on at least one of the dependent variables (Sozu et al., Citation2016). The superiority and inferiority spaces are defined as: (14) $\begin{matrix} \begin{matrix} S_{S}^{Any} = δ (x) | max_{1 < k < K} δ^{k} (x) > 0 \end{matrix} \\ \begin{matrix} S_{I}^{Any} = δ (x) | min_{1 < k < K} δ^{k} (x) < 0. \end{matrix} \end{matrix}$ (14)
All rule: The All rule results in superiority or inferiority when the difference between success probabilities is larger or smaller than zero respectively on all of the dependent variables (Sozu et al., Citation2010). The superiority and inferiority spaces are defined as: (15) $\begin{matrix} \begin{matrix} S_{S}^{All} = δ (x) | min_{1 < k < K} δ^{k} (x) > 0 \end{matrix} \\ \begin{matrix} S_{I}^{All} = δ (x) | max_{1 < k < K} δ^{k} (x) < 0. \end{matrix} \end{matrix}$ (15)
Compensatory rule: The Compensatory rule results in superiority or inferiority when the weighted difference between success probabilities is larger or smaller than zero respectively. The superiority and inferiority spaces are defined as: (16) $\begin{matrix} \begin{matrix} S_{S}^{Comp} (w) = δ (x) | δ (w, x) > 0 \end{matrix} \\ \begin{matrix} S_{I}^{Comp} (w) = δ (x) | δ (w, x) < 0 \end{matrix} \end{matrix}$ (16)

where

w = (w^{1}, \dots, w^{K})

reflect weights of K treatment differences,

δ (w, x) = \sum_{k = 1}^{K} w^{k} δ^{k} (x),

0 \leq w^{k} \leq 1

and

\sum_{k = 1}^{K} w^{k} = 1

(Kavelaars et al., Citation2020).

2.4. Sample size computations

Decisions resulting from analysis with the Bayesian multivariate logistic regression framework are based on a posterior probability. In absence of prior information, the Bayesian posterior probability has a direct relation with the frequentist p-value: The Bayesian posterior probability equals $1 - p$ and behaves according to the well-known relationship between effect size, sample size, and decision error rates (Marsman & Wagenmakers, Citation2016). This allows for control of decision error rates (Type I and Type II-error) via a priori computed sample sizes. Methods to compute required sample sizes are available for variables that follow a multivariate Bernoulli distribution and are eligible for large sample approximation by a (multivariate) normally distributed latent variable (Sozu et al., Citation2016, Citation2010; Chow et al., Citation2017). These procedures have shown to accurately control Type I rate $α$ and Type II error rate $β$ in a Bayesian multivariate Bernoulli - Dirichlet-model on multivariate response data with a non-informative prior distribution (Kavelaars et al., Citation2020). Each of the presented decision rules in Subsection 2.3 has an individual procedure to compute sample sizes, as discussed below. These equations provide insight into the required number of observations in absence of prior information and in the influence of the correlation on the sample sizes needed to obtain targeted decision error rates. For notational simplicity, we discard the dependence on $x$ in the remainder of this subsection.

2.4.1. All and Any rules

Sample size computations for the All and Any rules were formulated in Sozu et al. (Citation2010) and Sozu et al. (Citation2016) respectively and rely on the assumption of a multivariate normal latent variable. The power, $1 - β,$ can be expressed in terms of a cumulative K-variate normal distribution $Ψ_{K}$ with mean $0$ and correlation matrix $Σ$ (Sozu et al., Citation2016): (17) $\begin{matrix} 1 - β = Ψ_{K} (c^{1}, \dots, c^{K}) . \end{matrix}$ (17)

In EquationEquation 17(17) $\begin{matrix} 1 - β = Ψ_{K} (c^{1}, \dots, c^{K}) . \end{matrix}$ (17) , $c^{k}$ for outcome k is defined by the decision rule of interest. Further, the off-diagonal elements of $Σ$ denote (estimated) pairwise correlations between dependent variables.

For the Any rule, (18) $\begin{matrix} c^{k} = z_{(1 - \frac{α}{K})} - \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (18)

For the All rule, (19) $\begin{matrix} c^{k} = - z_{(1 - α)} + \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (19) In EquationEquations 18(18) $\begin{matrix} c^{k} = z_{(1 - \frac{α}{K})} - \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (18) and Equation19(19) $\begin{matrix} c^{k} = - z_{(1 - α)} + \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (19) , n is the sample size per treatment and $z_{(.)}$ refers to the selected $(1 - \frac{α}{K})$ or $(1 - α)$ quantile from the univariate normal distribution. Further, $θ_{1}^{k}$ and $θ_{0}^{k}$ refer to success probabilities on outcome variable $k$ of treatments $1$ and $0$ respectively.

Since the cumulative multivariate normal distribution does not have a closed-form, the sample size that satisfies targeted decision error rates can be found via the following iterative procedure proposed by Sozu et al. Sozu et al. (Citation2010):

Plug in estimates of $θ_{1}^{k}$ and $θ_{0}^{k}$ in EquationEquation 18(18) $\begin{matrix} c^{k} = z_{(1 - \frac{α}{K})} - \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (18) or Equation19(19) $\begin{matrix} c^{k} = - z_{(1 - α)} + \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (19) .
Plug in a starting value for n in EquationEquation 18(18) $\begin{matrix} c^{k} = z_{(1 - \frac{α}{K})} - \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (18) or Equation19(19) $\begin{matrix} c^{k} = - z_{(1 - α)} + \frac{(θ_{1}^{k} - θ_{0}^{k})}{\sqrt{\frac{θ_{1}^{k} (1 - θ_{1}^{k}) + θ_{0}^{k} (1 - θ_{0}^{k})}{n}}} . \end{matrix}$ (19) and calculate the power via EquationEquation 17(17) $\begin{matrix} 1 - β = Ψ_{K} (c^{1}, \dots, c^{K}) . \end{matrix}$ (17) .
Repeat step 2 with gradually increasing n until the power exceeds the desired level
Select n as the sample size per treatment group

2.4.2. Compensatory rule

Sample sizes for the compensatory rule can be computed using standard methodology for large sample tests with two binomial proportions (Chow et al., Citation2017, Chapter 4). Plugging in estimates of weighted success probabilities per treatment T, $θ_{T}^{w},$ results in: (20) $\begin{matrix} n = [θ_{1}^{w} (1 - θ_{1}^{w}) + θ_{0}^{w} (1 - θ_{0}^{w})] {[\frac{z_{1 - α} + z_{1 - β}}{θ_{1}^{w} - θ_{0}^{w}}]}^{2}, \end{matrix}$ (20) where $θ_{T}^{w} = \sum_{k = 1}^{K} w^{k} θ_{T}^{k},$ and $z_{1 - β}$ is the $(1 - β)$ quantile of the univariate normal distribution.

2.4.3. Correlation, sample size, and statistical power

We illustrate the relation between the sample size, the statistical power and the correlation between dependent variables with an example. We computed required sample sizes to obtain $80 %$ statistical power for the following bivariate ( $K = 2$ ) and trivariate ( $K = 3$ ) outcomes, where we used different correlations ( $ρ_{θ^{k}, θ^{l}}$ ) and multivariate treatment differences ( $δ$ ).

S2: $K = 2,$ $ρ_{θ^{k}, θ^{l}} \in - 0.20, 0.00, 0.20,$ $δ = (0.20, 0.10)$
S3: $K = 3,$ $ρ_{θ^{k}, θ^{l}} \in - 0.20, 0.00, 0.20,$ $δ = (0.20, 0.10, 0.20)$
L2: $K = 2,$ $ρ_{θ^{k}, θ^{l}} \in - 0.40, 0.00, 0.40,$ $δ = (0.30, 0.20)$
L3: $K = 3,$ $ρ_{θ^{k}, θ^{l}} \in - 0.40, 0.00, 0.40,$ $δ = (0.30, 0.20, 0.30)$

Scenarios S2 and S3 had a smaller multivariate treatment difference $δ$ and weaker non-zero correlations than scenarios L2 and L3.

shows the the required sample sizes for these scenarios as well as the anticipated statistical power for right-sided superiority decision-making under two scenarios:

When sample sizes computations are based on the true multivariate treatment difference $δ$ and the true correlation between dependent variables $ρ_{θ^{k}, θ^{l}} .$ This scenario aims to highlight that statistical power can be targeted when sample size computations follow the true data generating mechanism.
When sample sizes computations are based on the true multivariate treatment difference $δ$ and uncorrelated dependent variables (i.e., $ρ_{θ^{k}, θ^{l}} = 0$ ). This scenario provides insight in anticipated error rates where the correlation is not taken into account in sample size computations. This situation is equivalent to performing multiple univariate analyses on correlated dependent variables.

These probabilities are computed by plugging in true treatment differences and correlations, while using either the required sample size (scenario 1) or the sample size for uncorrelated data (scenario 2) in EquationEquations (17)–(20)).

This illustration provides five takeaways. First, larger effect sizes (L2 and L3) result in smaller required samples than smaller effect sizes (S2 and S3) respectively. Second, adding an additional dependent variable has the potential to reduce sample sizes. Required sample sizes are larger for a three-dimensional outcome (S3 and L3) than for a two-dimensional outcome (S2 and L2). Third, the required sample size depends on the correlation between dependent variables. Compared to uncorrelated dependent variables, the Any and Compensatory rules require fewer observations when dependent variables are negatively correlated, whereas positively correlated dependent variables require more observations. Consequently, when sample size computations do not take non-zero correlations into account, statistical power will be larger or smaller than targeted respectively. Fourth, the relation between correlation and required sample size is different for different decision rules. Compared to the Any and Compensatory rules, the All rule shows the opposite relation between the direction of the correlation and the required sample size. Here, positively correlated dependent variables require a smaller number of observations than uncorrelated or negatively correlated dependent variables. Moreover, the All rule appears less sensitive to the correlation than the other rules. Sample sizes are not very different and statistical power under independence is still close to the targeted .80. Fifth, the effect of the correlation on required sample size and statistical power is larger in the scenarios where non-zero correlations are stronger (L2 and L3). In these scenarios, the discrepancy between the targeted power of 0.80 and the actual power is larger for non-zero correlations. Further, the differences between presented sample sizes for negatively correlated, uncorrelated, and positively correlated dependent variables is larger compared to the scenarios with less strong correlation (S2 and S3).

These takeaways are in line with detailed discussions in Sozu et al. (Citation2010, Citation2016); Food and Drug Administration (Citation2017); Kavelaars et al. (Citation2020).

3. Estimating conditional average treatment effects

In the proposed framework, treatment heterogeneity can be captured by joint response probabilities that reflect conditional average treatment effects and thus depend on prespecified characteristics of a subpopulation of interest. We describe two ways to represent subpopulations: by fixed covariate values or by a prespecified interval of the covariate distribution(s). Both representations have their own applications. Fixed values of covariates may be relevant when we wish to investigate treatment effects based on individual patients or on patient populations that can be accurately represented by a single number of the covariate (such as a mean or a level of a discrete variable). Intervals of covariate distributions may be sensible in particular when multiple consecutive covariate values are sufficiently exchangeable to estimate a marginal treatment effect among a population specified by this range. Although such intervals can be specified for discrete covariates as well, their use is particularly reasonable with continuous covariates, as intervals are inherently consistent with the idea of continuity.

We will discuss procedures to estimate conditional average treatment effects based on fixed values and based on intervals in more detail in the remainder of this subsection. In these discussions, we use a linear predictor $ψ_{i}^{q} (x)$ (cf. EquationEquation 3(3) $\begin{matrix} ψ_{i}^{q} (x_{i}) = & β_{0}^{q} + β_{1}^{q} x_{i 1} + \dots + β_{P}^{q} x_{iP} . \end{matrix}$ (3) ) that distinguishes between treatments via a treatment indicator and allows for interaction between the treatment and a covariate. For such a model that includes a single population characteristic z, $x = (z, T, zT)$ and $ψ_{T}^{q} (x)$ is defined as: (21) $\begin{matrix} ψ_{T}^{q} (x) = & β_{0}^{q} + β_{1}^{q} T + β_{2}^{q} z + β_{3}^{q} zT . \end{matrix}$ (21)

3.1. Fixed values of covariate

For a patient population with fixed values of patient covariates, a posterior sample of joint response probabilities $ϕ_{T} (x)$ can be found by plugging in a vector of fixed covariate values $x$ in linear predictor $ψ_{T}^{(l)} (x) .$ Subsequently applying the multinomial logistic link function in EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) to each $ψ_{T}^{(l)} (x)$ results in joint response probability $ϕ_{T}^{(l)} (x)$ for treatment T. Applying these steps each posterior draw $(l)$ of regression coefficients $β^{(l)}$ results in a sample of posterior joint response probabilities. The procedure is presented in Algorithm 1 in Appendix C.

3.2. Marginalization over a distribution of covariates

When the population is characterized by a range of covariates, the treatment effect can be marginalized over the interval under consideration, based on available information regarding the distribution of the covariate.

A sample of covariate data can be used as input for marginalization. Empirical marginalization involves repeating the fixed values procedure for each subject in the sample to obtain a sample of joint response probabilities for each posterior draw $(l) .$ Averaging the resulting sample of joint response probabilities per treatment results in a marginal joint response probability $ϕ_{T}^{(l)} (x)$ for draw $(l) .$ The procedure is presented in Algorithm 2 in to Appendix C. Empirical marginalization is computationally efficient for patient populations defined by intervals of more than one continuous covariate. Note however that the procedure is prone to sampling variability in $x$ and that estimation might depend on the availability of cases with the selected covariate values. Increasing the specificity of subpopulations – often resulting from a higher number of included covariates and/or a limited interval size – will reduce the number of available observations eligible for inclusion.Footnote¹

4. Numerical evaluation

The current section presents an evaluation of the performance of the proposed multivariate logistic regression procedure. The goal of the evaluation was twofold and we aimed to demonstrate:

how well the obtained regression coefficients and treatment effects correspond to their true values to examine bias;
how often the decision procedure results in an (in)correct superiority conclusion to learn about decision error rates when sample sizes are estimated a priori.

4.1. Setup

4.1.1. Conditions

The performance of the framework was evaluated in a treatment comparison based on one covariate and two dependent variables. In Appendix D, we present an evaluation of the performance with three dependent variables. Six aspects were varied: the analysis procedure, the effect size, measurement level of the covariate, the correlation between dependent variables, the (sub)population, and the decision rule. Each of these factors will be discussed in the following paragraphs.

4.1.1.1. Analysis procedure

We present three Bayesian analysis procedures:

(1) Multivariate logistic regression analysis (mLR): We analyzed the generated data via the proposed Bayesian multivariate logistic regression model presented in Section 2.

The performance of the mLR-model was compared to two reference approaches:

(2) Multivariate Bernoulli analysis (mB): To demonstrate the gain of a multivariate regression approach over multivariate subgroup analysis (i.e., multivariate stratified analysis), we fitted the unconditional Bayesian multivariate Bernoulli model in Kavelaars et al. (Citation2020) to the data as well. Whereas the multivariate Bernoulli model takes the correlation between dependent variables into account, the multivariate Bernoulli model computes conditional average treatment effects via stratified multivariate analysis: the multivariate Bernoulli model only uses the response data from observations that belong to the (sub)population of interest. Hence, the estimation of ATEs uses the full sample of response data, whereas CATEs are estimated based on a subsample of response data. Samples of treatment-specific joint response probabilities $ϕ_{T}$ could be drawn directly from a posterior Dirichlet distribution with parameters $α_{T}^{n} = α^{0} + {\sum_{i = 1}^{n} I (T_{i} = T) I (y_{i} = H_{q \cdot})}_{q = 1}^{Q},$ where $α^{0}$ is a vector of Q prior hyperparameters.

(3) Univariate logistic regression (uLR): To demonstrate the added value of a multivariate model over multiple univariate models, we fitted Bayesian univariate logistic regression models from Polson et al. (Citation2013) to the individual dependent variables for the scenario with two dependent variables. This univariate model is a special case of the multivariate model presented in Section 2 and Appendix A. While these regression-based models use the full sample of data to estimate conditional average treatment effects among subpopulations, they cannot capture correlations between dependent variables.

4.1.1.2. Datagenerating mechanisms: effect size, measurement level of covariate, and correlation

We included treatment differences of four different sizes that varied in heterogeneity:

Effect size $1.1$ & $1.2$ : A homogeneous treatment effect, with average and conditional treatment differences of zero. This scenario aims to demonstrate the Type I error rate under a least favorable treatment difference for the Any and Compensatory rules in the trial as well as the subpopulation.
Effect size $2.1$ & $2.2$ : A heterogeneous treatment effect, with an average treatment difference of zero and a conditional treatment effect larger than zero.
Effect size $3.1$ & $3.2$ : A heterogeneous treatment treatment effect, with one average and both conditional treatment differences larger than zero. The conditional treatment difference is larger than the average treatment difference. The effect size is chosen to compare power of different methods, when the sample size should not lead to underpowerment for any of the approaches to the estimation of conditional average treatment effects. The effect size of the conditional average treatment effect reflects the least favorable average treatment effect for a right-sided test of the All rule and should result in a Type I error rate equal to the chosen level of $α$ if the sample size is sufficiently large.
Effect size $4.1$ & $4.2$ : A heterogeneous treatment treatment effect, with one average and both conditional treatment differences larger than zero. The conditional treatment difference is smaller than the average treatment effect. The effect size is chosen such that the expected sample size after stratification of the study sample is smaller than the required sample for evaluation of the conditional treatment effect and aims to investigate the statistical power of regression-based methods when stratification leads to underpowered decisions. Similar to effect size 3.1/3.2, the effect size of the conditional average treatment effect reflects the least favorable effect for a right-sided test of the All rule and should result in a Type I error rate equal to the chosen level of $α$ if the sample size is sufficiently large.

For each of these four effect sizes, we varied the measurement level of the covariate and created a model with a binary covariate and a model with a continuous covariate. Further, we specified three pairwise correlations for the dependent variables: a negative correlation ( $ρ_{θ^{k}, θ^{l}} = - .20$ ), no correlation ( $ρ_{θ^{k}, θ^{l}} = .00$ ), and a positive correlation ( $ρ_{θ^{k}, θ^{l}} = .20$ ). These pairwise correlations were identical for all dependent variable pairs and were specified for the conditional average treatment effects ( $x = 0$ and $x = 1$ for the dichotomous covariate; at $x = - 1$ and $x = 1$ for the continuous covariate). The correlation structures and effect sizes of the conditional average treatment effects determine, together with the probability distribution of the covariates, the correlation and effect size of average treatment effects.

These four effect sizes, two measurement levels of the covariate, and three correlation structures resulted in the $4 \times 2 \times 3 = 24$ data generating mechanisms (DGMs) presented in .

4.1.1.3. Treatment effects and (sub)populations

We estimated three different treatment effects:

An average treatment effect (ATE) among the trial population. The trial population with a discrete covariate was defined by a binomially distributed covariate with a probability of 0.50. The trial population with a continuous covariate was defined by a covariate that followed a standard normal distribution.
A conditional average treatment effect (CATE) among a subpopulation defined by a sample of covariate an interval of a continuous covariate. This treatment effect was also estimated among patients scoring low on the covariate, but this time the subpopulation was defined as all values between the mean and one standard deviation below the mean. Note that the discrete covariate could not be assigned an interval, since subsetting a binary variable inherently results in a single value.
A conditional average treatment effect (CATE) among a subpopulation defined by a fixed value of a covariate. The treatment effect was estimated among patients scoring low on the covariate and was described by a value of 0 (discrete covariate) or $- 1$ (continuous covariate).

4.1.1.4. Decision rules and sample size

We applied the three decision rules from Subsection 4.1.2:

Any rule
All rule
Compensatory rule with unequal weights ( $w = (0.75, 0.25)$ )

We computed sample sizes per treatment group via the procedures from Subsection 2.4 for conditions with non-zero true average treatment effects targeting at a power of 0.80 and a right-sided $α$ of 0.05. If the true average treatment difference was equal to zero, we used $n = 1, 000$ per treatment group. The sample size for the average treatment effect was thus leading for the analysis of both average and conditional average treatments effects. As a result, the power of conditional treatment effects was not targeted at 0.80, but should exceed this target when the required sample size for a CATE was larger than the sample size for an ATE. Similarly, the power of CATEs with a sample size smaller than the ATE sample size should be lower than .80. The required sample sizes are presented in . In these tables, we also included a) the required sample size for the conditional average treatment effect in the subpopulation; and b) the sample size after stratification of the trial population. The sample size after stratification is the expected size in subpopulation analysis of a) response data in a stratified analysis approach; and b) covariate data in empirical marginalization.

4.1.2. Procedure

4.1.2.1. Data generation

For each data generating mechanism and each unique (decision-rule specific) sample size, we sampled 1000 datasets. We generated one covariate x and included an interaction between the treatment and the covariate as well, resulting in the following linear predictor $ψ_{i}^{q} :$ (22) $\begin{matrix} ψ_{i}^{q} (x_{i}) = β_{0}^{q} + β_{T}^{q} T_{i} + β_{1}^{q} z_{i} + β_{2}^{q} z_{i} T_{i} . \end{matrix}$ (22)

To generate response data, we first applied the multinomial logistic link function (EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) ) to each true linear predictor $ψ_{i} (x_{i})$ to obtain joint response probabilities $ϕ_{i} (x_{i})$ for each subject i. Next, we sampled response vector $y_{i} | x_{i}$ from a multinomial distribution with probabilities $ϕ_{i} (x_{i}) .$

4.1.2.2. Prior distribution

We specified diffuse prior distributions. This is motivated by the idea that obtained the posterior distributions are then completely based on the information in the data, which is a common choice in default Bayesian analyses. For the multivariate logistic regression analysis, we set multivariate normally distributed prior with means $b^{q} = 0$ and variance matrix $B^{0 q} = diag (10, \dots, 10)$ for all regression coefficients. Prior covariances between regression coefficients were set at zero, implying that regression coefficients were independent a priori. For the univariate logistic regression analysis we used univariate normally distributed priors with a mean of 0 and a variance of $10$ for all parameters. The specified variance parameters of regression coefficients were motivated by work of Gelman et al. (Citation2008). These authors recommend to choose a variance parameter that results in realistic support for the probability parameter after non-linear transformations in logistic regression and has sufficient information to stabilize posterior computations. For the mB reference approach, we used a Dirichlet prior distribution with hyperparameters $α^{0} = 0.01 .$ This prior is close to the improper Haldane prior ( $α = 0$ ), which is considered the least informative prior distribution for bi- or multinomially distributed data, results in a posterior mean equal to the maximum likelihood estimator, and corresponds to a uniform prior on the log-odds scale (Tuyl et al., Citation2008; Kerman, Citation2011). The small deviation from the Haldane prior makes the prior distribution proper and ensures that cell probabilities can be sampled from the Dirichlet distribution when cells have no observations (Kavelaars et al., Citation2020).

4.1.2.3. Gibbs sampling

The regression coefficients in response categories $1, \dots, (Q - 1)$ were estimated via the Gibbs sampler detailed in Appendix A. We ran two MCMC-chains with $L = 10, 000$ iterations plus 1,000 burnin iterations. We visually inspected traceplots of MCMC-chains and used multivariate Gelman-Rubin convergence diagnostics to assess convergence (Gelman & Rubin, Citation1992; Brooks & Gelman, Citation1998). As these traceplots showed satisfactory overlap between chains and the convergence diagnostics were all between 1.00 and 1.10, we concluded that there were no issues with convergence.

4.1.2.4. Transformation and decision-making

We applied the procedures from Subsections 2.2 and 2.3 to arrive at a decision. In marginalization, we included the selection of subjects that belonged to the subpopulation. We performed a right-sided (superiority) test aiming at a Type I-error rate of $α = .05 .$ We used a decision threshold $p_{cut} = 1 - α = 0.95$ (Compensatory and All rules) and a for multiple tests corrected $p_{cut} = 1 - \frac{α}{K} = 0.975$ (Any rule) (Marsman & Wagenmakers, Citation2016; Kavelaars et al., Citation2020; Sozu et al., Citation2016).

4.1.3. Software

We conducted our analyses in R (R Core Team, Citation2020). We drew variables from the multivariate normal, Pólya-Gamma, and Dirichlet distributions with the MASS, pgdraw and MCMCpack packages respectively (Venables & Ripley, Citation2002; Makalic & Schmidt, Citation2016; Martin et al., Citation2011). We used the coda package to explore MCMC chains (Plummer et al., Citation2006). The simulation procedure was parallellized using the foreach and doParallel packages (Microsoft & Weston, 2020a, Citation2020b). LaTeXtables were created with the xtable package (Dahl et al., Citation2019).

4.2. Results

4.2.1. Bias

Bias of multivariate and weighted treatment differences was negligible ( $< | .01 |$ ) in most conditions, implying that Bayesian multivariate logistic regression analysis was generally able to reproduce true treatment effects. However, the estimation of average treatment effects under effect sizes 4.1 and 4.2 resulted in slightly biased treatment differences for the Any and Compensatory rules. As shown in , these absolute biases ranged up to |0.04|. These biases were produced in both univariate and multivariate logistic regression analysis, but not in multivariate Bernoulli analysis. Conditional average treatment effects were estimated with comparable patterns of bias and a maximum of |0.025|. This bias showed up in conditions with a small sample, which is a well-documented property of logistic regression in general (Nemes et al., Citation2009).

This bias in treatment differences could be traced back to bias in regression coefficients. Mean estimates of regression coefficients were asymptotically unbiased, implying that bias was negligible ( $< | 0.01 |$ ) in conditions with a sufficiently large sample. We observed some bias in conditions with smaller samples (ES 3.1, 3.2, 4.1, and 4.2 under the Any and Compensatory decision rules). We can conclude that bias in regression coefficients was not necessarily problematic for our actual parameters of interest, namely success probabilities and differences between them. Even when regression coefficients had a small bias ( $< | 0.20 |$ on the log-odds scale), success probabilities and treatment differences could be estimated without bias ( $< | 0.01 |$ ), similar to the conditions without biased regression coefficients. This was the case for ES 3.1 and 3.2 under sample sizes of the Any and Compensatory rules. Only more severe bias of regression coefficients ( $< | 0.57 |$ on the log-odds scale) in conditions with smaller sample sizes was not fully corrected in the transformation steps. This was seen in ES 4.1 and 4.2 under sample sizes of the Any and Compensatory rules.

4.2.2. Decision error rates

4.2.2.1. Average treatment effects

Probabilities to conclude superiority of average treatment effects are presented in . Decisions resulted in appropriate Type I error rates around 0.05 for each of the posterior distribution types under a least favorable scenario of no effect (i.e., ES 1.1, 1.2, 2.1, 2.2 of Any and Compensatory rules) and the proportions of correct superiority conclusions (i.e., power) were close to the targeted 0.80 under a priori estimated sample sizes when the true effect was larger than zero (i.e., ES 3.1, 3.2, 4.1, 4.2 of Any and Compensatory rules). These results showcase that a priori computed sample sizes result in adequate statistical decisions.

In general, multivariate logistic regression (mLR) performed comparable to stratified multivariate analysis (mB) in the estimation of average treatment effects: Type I-error rates of mB were around 0.05 and statistical power was close to the targeted 0.80 as well. Compared to univariate logistic regression analysis (uLR), statistical power of multivariate logistic regression (mLR) appeared less sensitive to the correlation of the data. Effect sizes $3.1 / 3.2$ and $4.1 / 4.2$ under the Compensatory rule demonstrate most clearly how power of uLR increased when the correlation moved from negative to positive, with uncorrelated data reaching the targeted .80. For these conditions, the sample size which the uLR model was fitted on was smaller and larger respectively than needed for an analysis that assumes uncorrelated data. The difference between uLR and mLR was relatively subtle however, which is in line with the pattern of required sample sizes in . This table shows that differences in required sample sizes for different correlations were relatively small under most data-generating mechanisms. This implies that the effect of using an incorrect sample size on statistical power is relatively limited under the data-generating mechanisms in the simulation study, in contrast with the scenarios presented in .

4.2.2.2. Conditional average treatment effects

The results of conditional treatment effects in the subpopulations are presented in . Similar to average treatment effects, Type I error rates were around the targeted 0.05 under the least favorable scenarios of no effect (ES 1.1, 1.2 for Any and Compensatory rules) for all estimation methods. The proportion to conclude superiority correctly was above 0.80 in all scenarios with a sample size exceeding the required sample size for CATEs. In the scenarios where the sample size for CATEs was lower than requirer (4.1 and 4.2 for the Any and Compensatory rules and 2.2 and 3.2 for the All rule), the power was below 0.80.

A comparison of estimations methods for the continuous covariate revealed that multivariate logistic regression (mLR) was generally more powerful than the stratified multivariate analysis (mB) approach when the covariate was continuous. These effects are prominent in ES 2.2 and 3.2 (All rule) as well as ES 4.2 (Any and Compensatory rules). The statistical power of stratified multivariate analysis (mB) and multivariate logistic regression analysis (mLR) did not differ for the discrete covariate, as demonstrated under ES 2.1 and 3.1 (All rule) as well as ES 4.1 (Any and Compensatory rules).

5. Illustration

We applied the proposed method to a subset of data from the $n = 19, 435$ subjects from the International Stroke Trial (International Stroke Trial Collaborative Group, Citation1997). We selected participants who were alive after six months and were treated with either a combined treatment (Aspirin + medium / high-dose Heparin) or one of the single treatments (Aspirin only), resulting in a sample of $n = 5, 657$ participants, of which $n_{H + A} = 1, 859$ were in the Heparin + Aspirin group (treatment = 1) and $n_{A} = 3, 798$ subjects were in the Aspirin group (treatment = 0). We fitted the model in EquationEquation 6(6) $\begin{matrix} ψ_{i}^{q} (x_{i}) = & β_{0}^{q} + β_{1}^{q} T_{i} + β_{2}^{q} b p_{i} + β_{2}^{q} b p_{i} T_{i}, \end{matrix}$ (6) to compare the effects of the two treatments on a) recurrent stroke within 14 days (0 = no; 1 = yes) and b) dependency after six months (0 = no, 1 = yes) while taking systolic blood pressure of the subjects (Bp) into account.

5.1. Method

We applied the two procedures from Subsection 3 (fixed values and interval of the covariate) to assess the multivariate and weighted treatment differences in three different types of patient populations:

Average treatment effects in the trial population;
Conditional treatment effects in populations defined by a fixed value. Patient populations were defined by six different values of blood pressure, specifically $1, 2, and 3$ standard deviations below and above the mean.
Conditional treatment effects in populations defined by an interval. Patient populations were defined by two different regions of blood pressure: $Bp < - 1 SD$ (Low), and $Bp > 1 SD$ (High).

Similar to the Numerical evaluation, we specified a diffuse multivariate normally distributed prior with means $b^{q} = 0$ and variance matrix $B^{0} = diag (10, \dots, 10)$ for all regression coefficients, except the reference category ( $strk = 0, dep = 0$ ). Prior covariances between regression coefficients were set at zero, implying that regression coefficients were independent a priori. We ran three MCMC-chains via our proposed Gibbs sampler with 20,000 iterations plus 10,000 burnin iterations. Similar to the simulation study, we used traceplots and multivariate Gelman-Rubin convergence diagnostics to assess convergence (Gelman & Rubin, Citation1992; Brooks & Gelman, Citation1998). Traceplots () showed that chains mixed properly and the multivariate Gelman-Rubin convergence statistic had a value of 1.000, implying that there were no signals of non-convergence.

We performed two-sided tests for the All, Any, and Compensatory rules. For the Compensatory rule, we assumed that long-term impaired functioning is more important than short-term complications and specified weights $w = (0.25, 0.75)$ for recurring stroke in 14 days and dependency at 6 months respectively. These weights implied that the longterm outcome was three times more relevant for the decision than the shortterm outcome. Since $θ_{T}$ reflects failure probabilities rather than success probabilities, the treatment is considered superior when there is sufficient evidence that the treatment difference of interest is smaller than zero, while inferiority was concluded when the treatment difference of interest is larger than zero. The two-sided test with a targeted Type I-error rate of $α = 0.05$ was performed with a decision threshold $p_{cut} = 1 - \frac{α}{2} = 0.975$ (Compensatory and All rules) and a for multiple tests corrected $p_{cut} = 1 - \frac{α}{2 K} = 0.9875$ (Any rule).

5.2. Results

Results are presented in for different intervals and in for fixed values of blood pressure. Among the trial population, the regression-based and reference approaches resulted in similar treatment difference estimates and posterior probabilities. Treatment differences were close to zero and each of the decision rules resulted in the conclusion that it did not matter whether Aspirin was administered alone or in combination with Heparin.

These average treatment effects gave a limited impression of the efficacy of Aspirin and Heparin, since a picture of heterogeneous treatment effects emerged when conditional treatment effects among subpopulations were considered separately. As opposed to Aspirin only, the combination of Aspirin and Heparin showed a trend toward higher failure probabilities on both dependent variables for patients with a lower blood pressure, while failure probabilities were generally lower among patients with a higher blood pressure.

A visual comparison of multivariate logistic regression (mLR) and stratified multivariate analysis (mB) of response data resulted in relatively similar estimates and posterior probabilities in the center of the distribution of blood pressure (e.g., between $- 1$ SD and $+ 1$ SD), but deviated from the regression-based approach in the tails. Point estimates of treatment differences demonstrated a less stable relation between blood pressure and treatment differences after stratification, as shown in . If the regression-based approach is flexible enough to properly model the effects over the full support of blood pressure, the different behavior in the tails of the covariate distribution might be explained by the smaller sample size after stratification, as implied by the larger error bars.

6. Discussion

The current paper proposed a novel Bayesian multivariate logistic regression framework for analysis and decision-making with multiple correlated dependent variables. The framework is suitable to capture treatment heterogeneity among (groups of) patients that are distinguishable by observed covariate information (i.e., conditional average treatment effects) and to estimate overall treatment effects among the full population (i.e., average treatment effects) under a wide range of scenarios. In general, the proposed regression models were able to reproduce point estimates of average and conditional treatment differences correctly and resulted in decisions with anticipated error rates among the trial population and among subpopulations – as long as the sample was sufficiently large. Further, anticipated decision error rates were found under a priori sample size estimation for different correlation structures (namely negatively correlated, uncorrelated, and positively correlated dependent variables) and for two- and three-dimensional dependent variables. The illustration with the International Stroke Dataset demonstrated how conditional average treatment effects could provide a more in-depth understanding of results beyond average treatment effects.

Compared to other approaches, the Bayesian multivariate logistic regression framework showed favorable properties. Decisions were more powerful than those obtained by multivariate stratified analysis when covariates were continuous, since they were based on information from the full sample rather than a subsample. Moreover, the Bayesian multivariate logistic regression model was more effective in targeting statistical power compared to multiple univariate logistic regression analyses when the correlation between dependent variables was non-zero. Whereas these effects were relatively subtle in the simulation study, the illustrative example in Section 2.4.3 showcased that they are more prominent when correlations are further from zero.

An advantage of the proposed multivariate logistic regression approach is its flexibility to model multivariate treatment effects with correlation structures that are free to vary over covariates, supporting accurate decision error rates and a priori sample size computations. This flexiblity comes with additional parameters, compared to other multivariate logistic models for correlated binary dependent variables (e.g., Malik & Abraham, Citation1973; O’Brien & Dunson, Citation2004) and may result in computational issues when the number of parameters becomes too high. The Gibbs sampling procedure may become unstable when the sample size is too small compared to the number of parameters, although weakly informative priors may be helpful in stabilizing computations (Gelman et al., Citation2008). Therefore, the model is most suitable for a limited number of dependent variables and (continous) covariates.

In practice, researchers are encouraged to consider model assumptions in real data. Additional efforts may be undertaken to verify that the chosen generalized linear model fits the data well enough. If the assumption of linearity on the log-odds scale does not hold, the modeling procedure may benefit from generalization to methods that are more flexible with respect to this assumption, such as (penalized) splines. Again, increased flexibility increases the number of parameters and should be balanced with a) the general risk of overfitting; and b) computational challenges as outlined above. In a more general sense, the researcher should determine which type of flexibility is most appropriate for the research question and data at hand. Further, researchers who aim to target decision error rates have to decide which treatment effect should be leading in the actual choice of sample size. Under treatment heterogeneity, average and (multiple) conditional average treatment effects have different effect sizes by definition, resulting in different sample sizes and raising the question which considerations meaningfully guide this choice.

Theoretically, the framework lends itself for use under a much wider range of scenarios than showcased in this paper. Each of the elements – modeling, transformation, decision-making – can be replaced by an alternative, resulting in a large number of variations. Some variations, such as a less computationally intensive analysis model, a wider range of prior distributions, and interim monitoring as an alternative to decision-making with a priori estimated sample sizes, were presented already (Kavelaars et al., Citation2020). Here, we mention two additional suggestions to elaborate the framework. First, in addition to the presented transformations to success probabilities and treatment, transformations to other associations between treatment and outcome, such as relative risks and risk ratios, may be of interest and are worth investigating. Second, other hypotheses than superiority and inferiority, such as non-inferiority or equality decision-making, can be relevant to be included in the framework as well (see for a discussion Van Ravenzwaaij et al., Citation2019). More flexible formulations of hypotheses and another perspective on the assessment of evidence can be achieved via the computation of Bayes factors (see for an introduction e.g., Mulder & Wagenmakers, Citation2016).

Other than the abovementioned variations, several directions for future research naturally follow from the current results. First, the procedure theoretically lends itself for out-of-sample prediction to populations within or beyond the covariate range of the trial population. The robustness of the framework in these applications remains to be investigated and may include evaluations of model fit.

Second, research might shed light on further sample size considerations. The current paper provided tools to compute required sample sizes and to control decision error rates, if researchers are able to estimate effect sizes with reasonable accuracy prior to the study and when sample sizes are sufficiently large. When sample sizes were relatively small, bias was introduced. In line with our observations, small-sample bias in regression coefficients is a well-documented property of nonlinear regression methods in general (Firth, Citation1993; Nemes et al., Citation2009). Although some bias in regression coefficients disappeared during transformation to joint response probabilities, success probabilities, and treatment differences, the mechanism is not yet fully understood. Hence, more light may be shed on circumstances for inheritance of distributional properties in the (non-linear) multinomial logistic transformation to obtain more elaborate insights into the minimum number of observations required for satisfactory model performance. Larger effect sizes (i.e., smaller sample sizes), complexity of the model (i.e., number of parameters), and events per variable are candidate factors to interact in their effects on model performance in small samples (De Jong et al., Citation2019). There is no short answer to that question, but in practice power among different subpopulations might be balanced with the number of subjects a researcher is willing or able to include in the trial. Therefore, optimum sample sizes in these regression-based decision approaches remain to be investigated more elaborately.

Further, another interesting direction for future research would be to extend the proposed multivariate logistic regression model for estimating average and conditional average treatment effects and for decision-making with (discrete or continuous) latent variables to capture unexplained heterogeneity. This extension falls outside of the scope of the current paper which focuses on modeling treatment heterogeneity caused by observed covariate information.

Lastly, causal inference is less straightforward in (stratified) subgroup analysis as conditioning upon covariates might interfere with randomization (European Medicine Agency, Citation2019; Food & Drug Administration, Citation2019). Causal relationships might require additional checking of assumptions and tutorials by Hoogland et al. (Citation2021) and Lipkovich et al. (Citation2016) may be of help.

Article information

Conflict of Interest Disclosures: Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.

Ethical Principles: The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.

Funding: This work was supported by Grant 406.18.505 from the Dutch Research Council.

Role of the Funders/Sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Acknowledgements: The authors would like to thank the anonymous reviewers for their comments on prior versions of this manuscript. The ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors’ institutions or the funding agency is not intended and should not be inferred. Also, we thank The International Stroke Trial Collaborative Group for making the data from the second International Stroke Trial publicly available.

Supplemental material

Supplemental Material

Download Latex File (26.1 KB)

Notes

1 If this is the case, (numerical) integration can be an alternative to interpolate the conditional treatment effect distribution of interest.

References

Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455.
Web of Science ®Google Scholar
Chen, M.-H., & Ibrahim, J. G. (2000). Power prior distributions for regression models. Statistical Science, 15(1), 46–60. https://doi.org/10.1214/ss/1009212673
Web of Science ®Google Scholar
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432), 1313–1321. https://doi.org/10.1080/01621459.1995.10476635
Web of Science ®Google Scholar
Chow, S.-C., Shao, J., Wang, H., & Lokhnygina, Y. (2017). Sample size calculations in clinical research (3rd ed.). Chapman and Hall/CRC.
Google Scholar
Chuang-Stein, C., Stryszak, P., Dmitrienko, A., & Offen, W. (2006). Challenge of multiple co-primary endpoints: A new approach. Statistics in Medicine, 26(6), 1181–1192. https://doi.org/10.1002/sim.2604
Web of Science ®Google Scholar
Dahl, D. B., Scott, D., Roosen, C., Magnusson, A., & Swinton, J. (2019). xtable: Export Tables to LaTeX or HTML R package version 1.8-4.
Google Scholar
Dai, B., Ding, S., & Wahba, G. (2013). Multivariate Bernoulli distribution. Bernoulli, 19(4), 1465–1483. https://doi.org/10.3150/12-BEJSP10
Web of Science ®Google Scholar
De Jong, V. M. T., Eijkemans, M. J. C., Calster, B., Timmerman, D., Moons, K. G. M., Steyerberg, E. W., & Smeden, M. (2019). Sample size considerations and predictive performance of multinomial logistic prediction models. Statistics in Medicine, 38(9), 1601–1619. https://doi.org/10.1002/sim.8063
PubMed Web of Science ®Google Scholar
European Medicine Agency. (2019). Guideline on the investigation of subgroups in confirmatory clinical trials.
Google Scholar
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. https://doi.org/10.1093/biomet/80.1.27
Web of Science ®Google Scholar
Food and Drug Administration (2019). Enrichment strategies for clinical trials to support determination of effectiveness of human drugs and biological products: Guidance for industry. Center for Biologics Evaluation and Research (CBER).
Google Scholar
Food and Drug Administration. (2016). Non-inferiority clinical trials to establish effectiveness: Guidance for industry.
Google Scholar
Food and Drug Administration. (2017). Multiple endpoints in clinical trials guidance for industry.Center for Biologics Evaluation and Research (CBER).
Google Scholar
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383. https://doi.org/10.1214/08-AOAS191
Web of Science ®Google Scholar
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. https://doi.org/10.1214/ss/1177011136
Google Scholar
Goldberger, J. J., & Buxton, A. E. (2013). Personalized medicine vs guideline-based medicine. JAMA, 309(24), 2559–2560. https://doi.org/10.1001/jama.2013.6629
PubMed Web of Science ®Google Scholar
Hamburg, M. A., & Collins, F. S. (2010). The path to personalized medicine. The New England Journal of Medicine, 363(4), 301–304. https://doi.org/10.1056/NEJMp1006304
PubMed Web of Science ®Google Scholar
Hoogland, J., IntHout, J., Belias, M., Rovers, M. M., Riley, R. D., E Harrell, F., Moons, K. G. M., Debray, T. P. A., & Reitsma, J. B. (2021). A tutorial on individualized treatment effect prediction from randomized trials with a binary endpoint. Statistics in Medicine, 40(26), 5961–5981. https://doi.org/10.1002/sim.9154
PubMed Web of Science ®Google Scholar
International Stroke Trial Collaborative Group. (1997). The international stroke trial (IST): A randomised trial of aspirin, subcutaneous heparin, both, or neither among 19, 435 patients with acute ischaemic stroke. The Lancet, 349(9065), 1569–1581.
PubMed Web of Science ®Google Scholar
Jones, H. E., Ohlssen, D. I., Neuenschwander, B., Racine, A., & Branson, M. (2011). Bayesian models for subgroup analysis in clinical trials. Clinical Trials (London, England), 8(2), 129–143. https://doi.org/10.1177/1740774510396933
PubMed Web of Science ®Google Scholar
Kavelaars, X., Mulder, J., & Kaptein, M. (2020). Decision-making with multiple correlated binary outcomes in clinical trials. Statistical Methods in Medical Research, 29(11), 3265–3277. https://doi.org/10.1177/0962280220922256
PubMed Web of Science ®Google Scholar
Kerman, J. (2011). Neutral noninformative and informative conjugate beta and gamma prior distributions. Electronic Journal of Statistics, 5, 1450–1470. https://doi.org/10.1214/11-EJS648
Web of Science ®Google Scholar
Lipkovich, I., Dmitrienko, A., & B, R. (2016). Tutorial in biostatistics: Data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine, 36(1), 136–196. https://doi.org/10.1002/sim.7064
PubMed Web of Science ®Google Scholar
Makalic, E., Schmidt, D. (2016). High-dimensional Bayesian regularised regression with the bayesreg package. arXiv:1611.06649v3.
Google Scholar
Malik, H. J., & Abraham, B. (1973). Multivariate logistic distributions. The Annals of Statistics, 1(3), 588–590. https://doi.org/10.1214/aos/1176342430
Web of Science ®Google Scholar
Marsman, M., & Wagenmakers, E.-J. (2016). Three insights from a Bayesian interpretation of the one-sided p value. Educational and Psychological Measurement, 77(3), 529–539. https://doi.org/10.1177/0013164416669201
PubMed Web of Science ®Google Scholar
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). MCMCpack: Markov chain Monte Carlo in R. Journal of Statistical Software, 42(9), 22. https://doi.org/10.18637/jss.v042.i09
Web of Science ®Google Scholar
McLachlan, G. J., Lee, S. X., & Rathnayake, S. I. (2019). Finite mixture models. Annual Review of Statistics and Its Application, 6(1), 355–378. https://doi.org/10.1146/annurev-statistics-031017-100325
Web of Science ®Google Scholar
Microsoft, & Weston, S. (2020). doParallel: Foreach Parallel adaptor for the ’parallel’ package R package version 1.0.16.
Google Scholar
Microsoft, & Weston, S. (2020). foreach: Provides foreach looping construct R package version 1.5.1.
Google Scholar
Mulder, J., & Wagenmakers, E.-J. (2016). Editors’ introduction to the special issue “bayes factors for testing hypotheses in psychological research: Practical relevance and new developments”. Journal of Mathematical Psychology, 72, 1–5. https://doi.org/10.1016/j.jmp.2016.01.002
Web of Science ®Google Scholar
Murray, T. A., Thall, P. F., & Yuan, Y. (2016). Utility-based designs for randomized comparative trials with categorical outcomes. Statistics in Medicine, 35(24), 4285–4305. https://doi.org/10.1002/sim.6989
PubMed Web of Science ®Google Scholar
Nemes, S., Jonasson, J. M., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology, 9(1), 56. https://doi.org/10.1186/1471-2288-9-56
PubMed Web of Science ®Google Scholar
O’Brien, P. C. (1984). Procedures for comparing samples with multiple endpoints. Biometrics, 40(4), 1079–1087.
PubMed Web of Science ®Google Scholar
O’Brien, S. M., & Dunson, D. B. (2004). Bayesian multivariate logistic regression. Biometrics, 60(3), 739–746. https://doi.org/10.1111/j.0006-341X.2004.00224.x
PubMed Web of Science ®Google Scholar
Olkin, I., & Trikalinos, T. A. (2015). Constructions for a bivariate beta distribution. Statistics & Probability Letters, 96, 54–60. https://doi.org/10.1016/j.spl.2014.09.013
Web of Science ®Google Scholar
Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). Coda: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7–11.
Google Scholar
Pocock, S. J. (1997). Clinical trials with multiple outcomes: A statistical perspective on their design, analysis, and interpretation. Controlled Clinical Trials, 18(6), 530–545. https://doi.org/10.1016/s0197-2456(97)00008-1
PubMedGoogle Scholar
Pocock, S. J., Geller, N. L., & Tsiatis, A. A. (1987). The analysis of multiple endpoints in clinical trials. Biometrics, 43(3), 487. https://doi.org/10.2307/2531989
PubMed Web of Science ®Google Scholar
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349. https://doi.org/10.1080/01621459.2013.829001
Web of Science ®Google Scholar
Prentice, R. L. (1997). Discussion: On the role and analysis of secondary outcomes in clinical trials. Controlled Clinical Trials, 18(6), 561–567. https://doi.org/10.1016/S0197-2456(96)00105-5
Google Scholar
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
Google Scholar
Ristl, R., Urach, S., Rosenkranz, G., & Posch, M. (2019). Methods for the analysis of multiple endpoints in small populations: A review. Journal of Biopharmaceutical Statistics, 29(1), 1–29. https://doi.org/10.1080/10543406.2018.1489402
PubMed Web of Science ®Google Scholar
Rossi, P. E., Allenby, G. M., & McCulloch, R. (2005). Bayesian statistics and marketing. John Wiley & Sons.
Google Scholar
Sandercock, P. A., Niewada, M., & Członkowska, A. (2011). The international stroke trial database. Trials, 12, 101. 1). https://doi.org/10.1186/1745-6215-12-101
PubMed Web of Science ®Google Scholar
Senn, S., & Bretz, F. (2007). Power and sample size when multiple endpoints are considered. Pharmaceutical Statistics, 6(3), 161–170. https://doi.org/10.1002/pst.301
PubMed Web of Science ®Google Scholar
Simon, R. (2010). Clinical trials for predictive medicine: New challenges and paradigms. Clinical Trials (London, England), 7(5), 516–524. https://doi.org/10.1177/1740774510366454
PubMed Web of Science ®Google Scholar
Sozu, T., Sugimoto, T., & Hamasaki, T. (2010). Sample size determination in clinical trials with multiple co-primary binary endpoints. Statistics in Medicine, 29(21), 2169–2179. https://doi.org/10.1002/sim.3972
PubMed Web of Science ®Google Scholar
Sozu, T., Sugimoto, T., & Hamasaki, T. (2016). Reducing unnecessary measurements in clinical trials with multiple primary endpoints. Journal of Biopharmaceutical Statistics, 26(4), 631–643. https://doi.org/10.1080/10543406.2015.1052497
PubMed Web of Science ®Google Scholar
Sullivan, S. G., & Greenland, S. (2012). Bayesian regression in SAS software. International Journal of Epidemiology, 42(1), 308–317. https://doi.org/10.1093/ije/dys213
PubMed Web of Science ®Google Scholar
Tang, D.-I., Geller, N. L., & Pocock, S. J. (1993). On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics, 49(1), 23–30. https://doi.org/10.2307/2532599
PubMed Web of Science ®Google Scholar
Thall, P. F. (2020). Bayesian cancer clinical trial designs with subgroup-specific decisions. Contemporary Clinical Trials, 90, 105860. https://doi.org/10.1016/j.cct.2019.105860
PubMed Web of Science ®Google Scholar
Tuyl, F., Gerlach, R., & Mengersen, K. (2008). A comparison of Bayes-laplace, Jeffreys, and other priors: The case of zero events. The American Statistician, 62(1), 40–44. https://doi.org/10.1198/000313008X267839
Web of Science ®Google Scholar
Van Ravenzwaaij, D., Monden, R., Tendeiro, J. N., & Ioannidis, J. P. A. (2019). Bayes factors for superiority, non-inferiority, and equivalence designs. BMC Medical Research Methodology, 19(1), 71. https://doi.org/10.1186/s12874-019-0699-7
PubMedGoogle Scholar
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S. 4th ed. Springer. ISBN 0-387-95457-0.
Google Scholar
Wang, M., Spiegelman, D., Kuchiba, A., Lochhead, P., Kim, S., Chan, A. T., Poole, E. M., Tamimi, R., Tworoger, S. S., Giovannucci, E., Rosner, B., & Ogino, S. (2015). Statistical methods for studying disease subtype heterogeneity. Statistics in Medicine, 35(5), 782–800. https://doi.org/10.1002/sim.6793
PubMed Web of Science ®Google Scholar
Xiong, C., Yu, K., Gao, F., Yan, Y., & Zhang, Z. (2005). Power and sample size for clinical trials when efficacy is required in multiple endpoints: Application to an alzheimer’s treatment trial. Clinical Trials (London, England), 2(5), 387–393. https://doi.org/10.1191/1740774505cn112oa
PubMed Web of Science ®Google Scholar
Yang, S., Li, F., Thomas, L. E., & Li, F. (2021). Covariate adjustment in subgroup analyses of randomized clinical trials: A propensity score approach. Clinical Trials (London, England), 18(5), 570–581. https://doi.org/10.1177/17407745211028588
PubMed Web of Science ®Google Scholar
Zhao, Y., Grambsch, P. M., & Neaton, J. D. (2007). A decision rule for sequential monitoring of clinical trials with a primary and supportive outcome. Clinical Trials (London, England), 4(2), 140–153. https://doi.org/10.1177/1740774507076936
PubMed Web of Science ®Google Scholar

Appendix A.

Details of posterior computation

The current section describes the Gibbs sampling procedure used to obtain parameters. To simplify notations, we omit the dependence on

x

in denoting functions that rely on covariates (e.g.,

ϕ,

θ

Starting from the likelihood of individual K-variate response $y_{i}$ (EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) ), the likelihood of n K-variate responses follows from taking the product over n individual joint response probabilities in Q response categories: (A1) $\begin{matrix} l (y | β, x) = & \prod_{i = 1}^{n} \prod_{q = 1}^{Q - 1} {(\frac{exp [ψ_{i}^{q}]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r}] + 1})}^{I (y_{i} = q)} {(\frac{1}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r}] + 1})}^{I (y_{i} = Q)} . \end{matrix}$ (A1)

Following Polson et al. (Polson et al., Citation2013), we introduce the Pólya-gamma variable by rewriting the multivariate likelihood in EquationEquation A1(A1) $\begin{matrix} l (y | β, x) = & \prod_{i = 1}^{n} \prod_{q = 1}^{Q - 1} {(\frac{exp [ψ_{i}^{q}]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r}] + 1})}^{I (y_{i} = q)} {(\frac{1}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r}] + 1})}^{I (y_{i} = Q)} . \end{matrix}$ (A1) as a series of binomial likelihoods. The likelihood of $y$ conditional on the parameters of the $q^{th}$ response category, $β^{q},$ then equals: (A2) $\begin{matrix} l (y | β^{q}, β^{- q}) = \prod_{i = 1}^{n} {(\frac{exp [η_{i}^{q}])}{exp [η_{i}^{q}] + 1})}^{I (y_{i} = q)} {(\frac{1}{exp [η_{i}^{q}] + 1})}^{1 - I (y_{i} = q)} \end{matrix}$ (A2) where $- q$ refers to all rows in $H$ not having index q and $η_{i}^{q} = ψ_{i}^{q} - ln (\sum_{m \neq H_{q \cdot}} exp [ψ_{i}^{m}]) .$

The Polya-Gamma transformation to a Gaussian distribution relies on the following equality (Polson et al., Citation2013): (A3) $\begin{matrix} \frac{exp [η_{i}^{q}]}{exp [η_{i}^{q}] + 1} & = 2 exp [(y_{i} - \frac{1}{2}) η_{i}^{q}] \int_{0}^{\infty} exp [\frac{- ω_{i} η_{i}^{q}^{2}}{2}] p (ω_{i}^{q}) d ω_{i}^{q} \end{matrix}$ (A3) where $ω_{i}^{q}$ has a Polya-Gamma distribution, i.e. $p (ω_{i}^{q}) \sim PG (1, ψ_{i}^{q}) .$

If we use the equality in EquationEquation A3(A3) $\begin{matrix} \frac{exp [η_{i}^{q}]}{exp [η_{i}^{q}] + 1} & = 2 exp [(y_{i} - \frac{1}{2}) η_{i}^{q}] \int_{0}^{\infty} exp [\frac{- ω_{i} η_{i}^{q}^{2}}{2}] p (ω_{i}^{q}) d ω_{i}^{q} \end{matrix}$ (A3) , the binomial likelihood in EquationEquation A2(A2) $\begin{matrix} l (y | β^{q}, β^{- q}) = \prod_{i = 1}^{n} {(\frac{exp [η_{i}^{q}])}{exp [η_{i}^{q}] + 1})}^{I (y_{i} = q)} {(\frac{1}{exp [η_{i}^{q}] + 1})}^{1 - I (y_{i} = q)} \end{matrix}$ (A2) can be transformed to a multivariate Gaussian likelihood by including an auxiliary Pólya-Gamma variable $ω_{i}^{q}$ (Polson et al., Citation2013): (A4) $\begin{array}{l} \begin{array}{l} l (y | β^{q}, β^{- q}) = \prod_{i = 1}^{n} \frac{exp [η_{i}^{q}]}{exp [η_{i}^{q}] + 1} \end{array} \\ = \prod_{i = 1}^{n} 2 exp [(y_{i} - \frac{1}{2}) η_{i}^{q}] \int_{0}^{\infty} exp [\frac{- ω_{i}^{q} η_{i}^{q}^{2}}{2}] p (ω_{i}^{q}) d ω_{i}^{q} \\ = \prod_{i = 1}^{n} exp [κ_{i}^{q} ω_{i}^{q} η_{i}^{q} - \frac{1}{2} {(η_{i}^{q})}^{2} ω_{i}^{q}] PG (ω_{i}^{q} | 1, 0) \\ \propto exp [\frac{1}{2} (2 κ^{q} ω^{q} η^{q} - ω^{q} {(η^{q})}^{2}] \\ \propto exp [- \frac{1}{2} {(κ^{q} - η^{q})}^{T} Ω^{q} (κ^{q} - η^{q})] \\ = exp [- \frac{1}{2} {(κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp (X β^{m})])}^{T} Ω^{q} (κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp [X β^{m}]]], \end{array}$ (A4) where $κ_{i}^{q} = \frac{I (y_{i} = H_{q \dots}) - \frac{1}{2}}{ω_{i}^{q}},$ $κ^{q} = (κ_{1}^{q}, \dots, κ_{n}^{q}),$ $ω^{q} = (ω_{1}^{q}, \dots, ω_{n}^{q}),$ and $Ω^{q} = diag (ω^{q}) .$

A.0.1 Prior distribution

The Gaussian likelihood in EquationEquation A4(A4) $\begin{array}{l} \begin{array}{l} l (y | β^{q}, β^{- q}) = \prod_{i = 1}^{n} \frac{exp [η_{i}^{q}]}{exp [η_{i}^{q}] + 1} \end{array} \\ = \prod_{i = 1}^{n} 2 exp [(y_{i} - \frac{1}{2}) η_{i}^{q}] \int_{0}^{\infty} exp [\frac{- ω_{i}^{q} η_{i}^{q}^{2}}{2}] p (ω_{i}^{q}) d ω_{i}^{q} \\ = \prod_{i = 1}^{n} exp [κ_{i}^{q} ω_{i}^{q} η_{i}^{q} - \frac{1}{2} {(η_{i}^{q})}^{2} ω_{i}^{q}] PG (ω_{i}^{q} | 1, 0) \\ \propto exp [\frac{1}{2} (2 κ^{q} ω^{q} η^{q} - ω^{q} {(η^{q})}^{2}] \\ \propto exp [- \frac{1}{2} {(κ^{q} - η^{q})}^{T} Ω^{q} (κ^{q} - η^{q})] \\ = exp [- \frac{1}{2} {(κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp (X β^{m})])}^{T} Ω^{q} (κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp [X β^{m}]]], \end{array}$ (A4) is conditionally conjugate with a normal prior distribution on regression coefficients $β^{q}$ : (A5) $\begin{matrix} β^{q} \sim N (b^{q}, B^{0 q}) \end{matrix}$ (A5) where $b^{q}$ is the vector of prior means of regression coefficient vector $β^{q}$ and $B^{0 q}$ is a $P \times P$ symmetric square matrix reflecting the prior precision of regression coefficients $β^{q} .$ A researcher who is willing to include prior information regarding treatment effects into the analysis, has several options to specify prior hyperparameters for a normally distributed prior that is compatible with the Gibbs sampling procedure (e.g. Sullivan & Greenland, Citation2012; Chen & Ibrahim, Citation2000). We discuss the specification of informative prior means $b^{q}$ in terms of joint response probabilities $ϕ$ in the next Appendix.

A.0.2 Posterior distribution

Bayesian statistical inference is done via the posterior distribution which is given by: (A6) $\begin{matrix} p (β | y) \propto & p (y | β, x) p (β), \end{matrix}$ (A6)

The combination of a Polya-Gamma transformed Gaussian likelihood (EquationEquation A4(A4) $\begin{array}{l} \begin{array}{l} l (y | β^{q}, β^{- q}) = \prod_{i = 1}^{n} \frac{exp [η_{i}^{q}]}{exp [η_{i}^{q}] + 1} \end{array} \\ = \prod_{i = 1}^{n} 2 exp [(y_{i} - \frac{1}{2}) η_{i}^{q}] \int_{0}^{\infty} exp [\frac{- ω_{i}^{q} η_{i}^{q}^{2}}{2}] p (ω_{i}^{q}) d ω_{i}^{q} \\ = \prod_{i = 1}^{n} exp [κ_{i}^{q} ω_{i}^{q} η_{i}^{q} - \frac{1}{2} {(η_{i}^{q})}^{2} ω_{i}^{q}] PG (ω_{i}^{q} | 1, 0) \\ \propto exp [\frac{1}{2} (2 κ^{q} ω^{q} η^{q} - ω^{q} {(η^{q})}^{2}] \\ \propto exp [- \frac{1}{2} {(κ^{q} - η^{q})}^{T} Ω^{q} (κ^{q} - η^{q})] \\ = exp [- \frac{1}{2} {(κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp (X β^{m})])}^{T} Ω^{q} (κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp [X β^{m}]]], \end{array}$ (A4) ) and a normal prior distribution (EquationEquation A5(A5) $\begin{matrix} β^{q} \sim N (b^{q}, B^{0 q}) \end{matrix}$ (A5) ) respectively is proportional to a normally distributed posterior distribution, conditionally on Polya-Gamma variables in $ω^{q}$ (Polson et al., Citation2013): (A7) $\begin{matrix} \begin{matrix} p (β^{q} | Y, Ω^{q}) \propto & p (y | β^{q}, ω^{q}) p (β^{q}) \end{matrix} \\ \begin{matrix} \propto exp [- \frac{1}{2} {(κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp [X β^{m}]])}^{T} Ω^{q} (κ^{q} - X β^{q} + ln [\sum_{m \neq q} exp [X β^{m}]])] \times \\ exp [- \frac{1}{2} {(β^{q} - b^{q})}^{T} {(B^{q})}^{- 1} (β^{q} - b^{q})] \\ \propto N (V^{q} (X^{T} Ω^{q} (κ^{q} + ln [\sum_{m \neq q} exp [X β^{m}]]) + {(B^{q})}^{- 1} b^{q}), V^{q}) \end{matrix} \end{matrix}$ (A7) where $V^{q} = {(X^{T} Ω^{q} X + {(B^{q})}^{- 1})}^{- 1} .$ Similarly, subject-specific variable $ω_{i}^{q}$ follows a Polya-Gamma distribution that depends on regression coefficients $β^{q}$ via linear predictor $ψ_{i}^{q} .$

Updating these two conditional distributions via a Gibbs sampling procedure results in a sample from the posterior distribution of $β .$ Specifically, the sampling procedure involves iterating L times over the following two steps for $q = 1, \dots, Q - 1,$ while keeping $β^{Q}$ fixed at zero:

Draw a vector of $P + 1$ regression coefficients $β^{q} | ω^{q}$ from a multivariate normal distribution with mean vector $m^{q}$ and precision matrix $V^{q} .$ (A8) $\begin{matrix} β^{q} | ω^{q} \sim N (m^{q}, V^{q}) \end{matrix}$ (A8) where $[V^{q}]^{- 1} = X Ω^{q} X + {[V^{0 q}]}^{- 1}$ $m^{q} = V^{q} (X (κ^{q} + Ω^{q} c) + {[V^{0 q}]}^{- 1} m^{0 q})$ $C = {ln {(\sum_{m \neq q} exp [ψ_{i}^{m}])}_{i = 1}^{n}} .$
Sample $ω^{q} | β^{q}$ as a vector of n draws $ω_{i}^{q} | β^{q}$ from a Pólya-Gamma distribution: (A9) $\begin{matrix} ω_{i}^{q} | β^{q} \sim PG (1, ψ_{i}^{q} - ln \sum_{m \neq q} exp [ψ_{i}^{m}]) . \end{matrix}$ (A9)

The Gibbs sampling procedure results in a sample of L sets of regression coefficients from the posterior distribution of $β .$

Appendix B.

Specification of prior means of regression coefficients

In the current Section, we introduce a procedure to determine prior means, based on beliefs regarding success probabilities and correlations between them. We outline the procedure for two outcome variables and a linear predictor $ψ$ with one covariate and an interaction between the treatment and the covariate: (B1) $\begin{matrix} ψ_{T}^{q} & = β_{0}^{q} + β_{1}^{q} T + β_{2}^{q} x + β_{3}^{q} x \times T \end{matrix}$ (B1)

First, choose $x_{L}$ and $x_{H}$ as low and high values of covariate x respectively. Next, specify success probabilities and correlations $θ_{T} (x^{L}),$ $ρ_{T} (x^{L}),$ $θ_{T} (x^{H}),$ and $ρ_{T} (x^{H})$ for each treatment T that accompany the low and high values of covariates respectively. These success probabilities $θ_{T} (x^{.})$ and correlations $ρ_{T} (x^{.})$ can be transformed to joint response probabilities $ϕ_{T} (x^{.})$ via the following set of equations: (B2) $\begin{matrix} \begin{matrix} ϕ_{T}^{11} (x^{.}) = ρ_{T} (x^{.}) \sqrt{θ_{T}^{1} (x^{.}) [1 - θ_{T}^{1} (x^{.})] θ_{T}^{2} (x^{.}) [1 - θ_{T}^{2} (x^{.})]} + θ_{T}^{1} (x^{.}) θ_{T}^{2} (x^{.}) \end{matrix} \\ \begin{matrix} ϕ_{T}^{10} (x^{.}) & = θ_{T}^{1} (x^{.}) - ϕ_{T}^{11} (x^{.}) \\ ϕ_{T}^{01} (x^{.}) & = θ_{T}^{2} (x^{.}) - ϕ_{T}^{11} (x^{.}) \\ ϕ_{T}^{00} (x^{.}) & = 1 - θ_{T}^{1} (x^{.}) - θ_{T}^{2} (x^{.}) + ϕ_{T}^{11} (x^{.}) \end{matrix} \end{matrix}$ (B2)

For each response category q, joint responses $ϕ_{T}^{q .}$ can be transformed to linear predictor $ψ_{T}^{q .}$ using the multinomial logistic link function in EquationEquation 2(2) $\begin{matrix} ϕ_{i}^{q} (x_{i}) = & \frac{exp [ψ_{i}^{q} (x_{i})]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r} (x_{i})] + 1} \end{matrix}$ (2) .

Solving these linear predictors for $β^{q}$ results in the following definitions of the elements in $β^{q} :$ (B3) $\begin{matrix} \begin{matrix} β_{0}^{q} & = \frac{x^{H} ψ_{0}^{q} (x^{L}) - x^{L} ψ_{0}^{q} (x^{H})}{x^{H} - x^{L}} \end{matrix} \\ \begin{matrix} β_{1}^{q} & = \frac{x^{H} [ψ_{1}^{q} (x^{L}) - ψ_{0}^{q} (x^{L})] + x^{L} [ψ_{0}^{q} (x^{H}) - ψ_{1}^{q} (x^{H})]}{x^{H} - x^{L}} \\ β_{2}^{q} & = \frac{ψ_{0}^{q} (x^{H}) - ψ_{0}^{q} (x^{L})}{x^{H} - x^{L}} \\ β_{3}^{q} & = \frac{ψ_{1}^{q} (x^{H}) - ψ_{0}^{q} (x^{H}) - ψ_{1}^{q} (x^{L}) + ψ_{0}^{q} (x^{L})}{x^{H} - x^{L}} \end{matrix} \end{matrix}$ (B3)

For example, if we would believe that treatment have the following parameters: $\begin{matrix} θ_{1}^{L} & = (0.60, 0.70), ρ_{1}^{L} = - 0.30 \\ θ_{1}^{H} & = (0.40, 0.30), ρ_{1}^{H} = - 0.30 \\ θ_{0}^{L} & = (0.40, 0.30), ρ_{0}^{L} = - 0.30 \\ θ_{0}^{H} & = (0.60, 0.70), ρ_{0}^{H} = - 0.30, \end{matrix}$ then the regression coefficients would be as presented in .

Appendix C.

Procedures for estimation and inference over a specified (Sub)population

Algorithm 1

Transformation of posterior regression coefficients to posterior joint response probabilities based on fixed covariate values.

Let $β^{Q} = (0, \dots, 0)$

for draw $(l) \leftarrow 1 : L$ do
for treatment $T \leftarrow 0 : 1$ do
for joint response $q \leftarrow 1 : Q$ do
Compute $ψ_{T}^{q (l)} = β_{0}^{q (l)} + β_{1}^{q (l)} T + β_{2}^{q (l)} x + β_{3}^{q (l)} x \times T$
Compute $ϕ_{T}^{q (l)} = \frac{exp [ψ_{T}^{q (l)}]}{\sum_{r = 1}^{Q - 1} exp [ψ_{T}^{r (l)}] + 1}$
end for
end for
end for

Algorithm 2

Transformation of posterior regression coefficients to posterior joint response probabilities based on empirical marginalization.

Let $β^{Q} = (0, \dots, 0)$

for draw $(l) \leftarrow 1 : L$ do
for subject $i \leftarrow 1 : n$ do
for joint response $q \leftarrow 1 : Q$ do
Compute $ψ_{i}^{q (l)} = β_{1}^{q (l)} T_{i} + β_{2}^{q (l)} x_{i} + β_{3}^{q (l)} x_{i} \times T_{i}$
Compute $ϕ_{i}^{q (l)} = \frac{exp [ψ_{i}^{q (l)}]}{\sum_{r = 1}^{Q - 1} exp [ψ_{i}^{r (l)}] + 1}$
for $T \leftarrow 0 : 1$ do
Compute $ϕ_{T}^{q (l)} (x) = \frac{1}{\sum_{i = 1}^{n} I (T_{i} = T)} ϕ_{i}^{q (l)} I (T_{i} = T)$
end for
end for
end for
end for

Appendix D.

Numerical evaluation with three outcome variables

In the current section, we present an evaluation of the BMLR framework with three dependent variables.

D.1 Setup

The evaluation largely follows the setup of the simulation with two dependent variables (Section 4). Aspects that differ from this simulation will be discussed here.

D.1.1 Analysis

We presented the results of Bayesian trivariate logistic regression analysis and compared it to a multivariate Bernoulli procedure.

D.1.2 Effect size

We presented the results of the Bayesian trivariate logistic regression analysis for a selection of effect sizes, namely 1.1 and 3.1. Using the three correlation structures ( $ρ < 0,$ $ρ \approx 0,$ and $ρ > 0$ ) for each of the effect sizes resulted in the six data generating mechanisms presented in .

D.1.3 Sample size

Similar to the Numerical evaluation, we applied the All, Any, and Compensatory rules. We assigned the Compensatory rule unequal weights $w = (0.50, 0.25, 0.25) .$

The required sample sizes for three outcome variables are computed via the procedure described in Section 2.4, targeting at a Type I error rate of 0.05 and a power of .80. The sample sizes are presented in .

D.1.4 Decision rule

We performed a right-sided (superiority) test aiming at a Type I-error rate of $α = 0.05 .$ We used a decision threshold $p_{cut} = 1 - α = 0.95$ (Compensatory and All rules) and a for multiple tests corrected $p_{cut} = 1 - \frac{α}{K} = 0.981$ (Any rule) (Marsman & Wagenmakers, Citation2016; Kavelaars et al., Citation2020; Sozu et al., Citation2016).

D.1.5 Procedure

To stabilize computations, we used 20,000 iterations for the multivariate Bernoulli model.

D.2 Results

Figure D1. Bivariate superiority and inferiority spaces for the All, Any, and Compensatory ( $w = 0.50, 0.50$ ) rules.

Figure D2. Traceplot of MCMC chains for the application of Bayesian multivariate logistic regression to the IST data.

Figure D3. Comparison of CATEs and their standard deviations per interval of blood pressure after stratified multivariate analysis (mB) and multivariate logistic regression (mLR). Each interval reflects one standard deviation.

Table D1. Example of required sample sizes (n) for analysis with correlated data and anticipated probabilities to conclude superiority when sample size computations use the true correlation ( $p_{T}$ ) vs. assume uncorrelated dependent variables ( $p_{U}$ ) under four different data-generating mechanisms (DGMs).

Display Table

Table D2. Parameters of average treatment effects (ATEs) in the trial and conditional average treatment effects (CATEs) in a subpopulation for two outcome variables.

Display Table

Table D3. Required sample sizes to evaluate the average treatment effect (ATE) and conditional treatment effect (CATE) for two outcome variables.

Display Table

Table D4. Bias in average treatment differences of effect size (ES) 4.1 and 4.2 by decision rule.

Display Table

Table D5. Proportions of superiority decisions for ATEs with two outcome variables by data-generating mechanism, correlation, and decision rule.

Display Table

Table D6. Proportions of superiority decisions for CATEs with two outcome variables by data-generating mechanism, correlation, and decision rule.

Display Table

Table D7. Average and conditional average treatment effects (ATE and CATE respectively) and their posterior probabilities (pp) in the IST data, by interval of blood pressure (Bp). Superiority or inferiority was concluded when $>$ or $<$ respectively.

Display Table

Table D8. Conditional average treatment effects in the IST data, by value of blood pressure (Bp). Superiority or inferiority was concluded when $>$ or $<$ respectively.

Display Table

Table D9. Example of means of the prior distribution of regression coefficients.

Display Table

Table D10. Parameters of average treatment effects (ATEs) in the trial and conditional average treatment effects (CATEs) in a subpopulation for tree outcome variables.

Display Table

Table D11. Required sample sizes to evaluate the average treatment effect (ATE) and conditional treatment effect (CATE) for three outcome variables.

Display Table

Table D12. Proportions of superiority decisions for three outcome variables by data-generating mechanism, correlation, and decision rule.

Display Table

Bayesian Multivariate Logistic Regression for Superiority and Inferiority Decision-Making under Observable Treatment Heterogeneity

Abstract

1. Introduction

2. Decision-framework

2.1. Multivariate logistic regression

2.2. Transformation to treatment differences

2.3. Posterior decision-making

2.4. Sample size computations

2.4.1. All and Any rules

2.4.2. Compensatory rule

2.4.3. Correlation, sample size, and statistical power

3. Estimating conditional average treatment effects

3.1. Fixed values of covariate

3.2. Marginalization over a distribution of covariates

4. Numerical evaluation

4.1. Setup

4.1.1. Conditions

4.1.1.1. Analysis procedure

4.1.1.2. Datagenerating mechanisms: effect size, measurement level of covariate, and correlation

4.1.1.3. Treatment effects and (sub)populations

4.1.1.4. Decision rules and sample size

4.1.2. Procedure

4.1.2.1. Data generation

4.1.2.2. Prior distribution

4.1.2.3. Gibbs sampling

4.1.2.4. Transformation and decision-making

4.1.3. Software

4.2. Results

4.2.1. Bias

4.2.2. Decision error rates

4.2.2.1. Average treatment effects

4.2.2.2. Conditional average treatment effects

5. Illustration

5.1. Method

5.2. Results

6. Discussion

Article information

Supplemental Material

Notes

References

Appendix A.

Details of posterior computation

A.0.1 Prior distribution

A.0.2 Posterior distribution

Appendix B.

Specification of prior means of regression coefficients

Appendix C.

Procedures for estimation and inference over a specified (Sub)population

Appendix D.

Numerical evaluation with three outcome variables

D.1 Setup

D.1.1 Analysis

D.1.2 Effect size

D.1.3 Sample size

D.1.4 Decision rule

D.1.5 Procedure

D.2 Results

Table D1. Example of required sample sizes (n) for analysis with correlated data and anticipated probabilities to conclude superiority when sample size computations use the true correlation (pT) vs. assume uncorrelated dependent variables (pU) under four different data-generating mechanisms (DGMs).

Table D2. Parameters of average treatment effects (ATEs) in the trial and conditional average treatment effects (CATEs) in a subpopulation for two outcome variables.

Table D3. Required sample sizes to evaluate the average treatment effect (ATE) and conditional treatment effect (CATE) for two outcome variables.

Table D4. Bias in average treatment differences of effect size (ES) 4.1 and 4.2 by decision rule.

Table D5. Proportions of superiority decisions for ATEs with two outcome variables by data-generating mechanism, correlation, and decision rule.

Table D6. Proportions of superiority decisions for CATEs with two outcome variables by data-generating mechanism, correlation, and decision rule.

Table D7. Average and conditional average treatment effects (ATE and CATE respectively) and their posterior probabilities (pp) in the IST data, by interval of blood pressure (Bp). Superiority or inferiority was concluded when > or < respectively.

Table D8. Conditional average treatment effects in the IST data, by value of blood pressure (Bp). Superiority or inferiority was concluded when > or < respectively.

Table D9. Example of means of the prior distribution of regression coefficients.

Table D10. Parameters of average treatment effects (ATEs) in the trial and conditional average treatment effects (CATEs) in a subpopulation for tree outcome variables.

Table D11. Required sample sizes to evaluate the average treatment effect (ATE) and conditional treatment effect (CATE) for three outcome variables.

Table D12. Proportions of superiority decisions for three outcome variables by data-generating mechanism, correlation, and decision rule.

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature

Table D7. Average and conditional average treatment effects (ATE and CATE respectively) and their posterior probabilities (pp) in the IST data, by interval of blood pressure (Bp). Superiority or inferiority was concluded when $>$ or $<$ respectively.

Table D8. Conditional average treatment effects in the IST data, by value of blood pressure (Bp). Superiority or inferiority was concluded when $>$ or $<$ respectively.