Abstract
We demonstrate how a linear factor model with latent variables can be used to estimate correlations between the outcomes of clinical trials. These correlations are needed for many policy questions of drug/vaccine development (such as calculating the optimal size of financial incentives) and the literature so far has relied on expert opinions. We apply our methodology to the case of vaccines and show that the estimated correlations are highly significant. We also illustrate how the estimated correlations can be used to find the probability of obtaining a successful vaccine out of a certain number of candidates and to determine optimal investment in vaccine development.
Acknowledgments
We are thankful for valuable comments and feedback from Artem Goryaev, Andrey Zubarev, Kristina Nesterova, Revold Entov, Sergey Sinelnikov-Murylyov, and other participants of seminars at RANEPA.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 For example, in educational setting they can be used to estimate students' ‘ability’ based on test outcomes. See section 4.13 in Bartholomew et al. [Citation2] for more details.
2 Typically we have several correlated discrete observations on a cross-section of random subjects and we want to estimate how these observations are affected by some explanatory variables taking into account the correlations between observations within a subject (see Song and Lee [Citation12] or Chib and Greenberg [Citation3]). For such estimation multivariate logit regressions can also be used (for example, Li and Wong [Citation7]).
3 We can do this by matching the quantiles of the actual distribution of with the quantiles of the standard normal distribution.
4 The reason why one should define the correlation structure for the continuous ‘quality’ variables instead of using the outcomes is that for the discrete random variables the correlation matrix does not fully describe the distribution, while for continuous (with the assumption of multivariate normal distribution) it is sufficient.
5 Using Cholesky decomposition for the covariance matrix we can represent each element as a weighted sum of standard normal random variables. We can call these variables factors.
6 We use a lower case letter to distinguish a specific realization from the random variable .
7 We illustrate how our methodology can be extended to incorporate additional information in a simulation study (#2) in Appendix. We should note, though, that adding more factors makes the estimation computationally rather challenging, and might require significant time or computing resourses.
8 The code for the estimation is available at https://github.com/sergeynzhuk/clinical-trials-correlations. The numerical integrals were calculated with ‘scipy.integrate’ Python library, for the optimization we used ‘scipy.optimize’ with the default BFGS algorithm. The estimation running times are for a machine with Intel Core i5-6500 CPU @ 3.20GHz, 8Gb RAM, Windows 10, Python 3.7.9.
9 Here we make an important assumption that the distributional parameters of the candidates do not change as we add more candidates. In reality, more promising projects are financed first. Thus, as we add more candidates, their success probabilities are likely to decrease. In addition, the candidates are likely to be similar to already developed ones. Thus, the correlations could increase as well.
10 For simplicity, we assume that we need only one successful vaccine. If additional successes bring additional benefits, the example can be straightforwardly modified to account for them as well.
11 The duration T is the combined duration of three phases of clinical trials (from (Equation39(39) (39) )). The average development cost I roughly corresponds to the expected present value of the investments required for each phase (see Equation (Equation38(38) (38) )).
12 For arbitrary structure of dependence among the variables one might be inclined to use e.g. copulas. In our case, though, there are too few observations to test the implications of relaxing the factor model assumption.
13 The code for the estimation is available at https://github.com/sergeynzhuk/clinical-trials-correlations. The numerical integrals were calculated with ‘scipy.integrate’ Python library, for the optimization we used ‘scipy.optimize’ with the default BFGS algorithm. When calculating the likelihood for model (d) we first constructed a function that calculates the likelihood for each disease for a given realization of the common factor and then calculated the average value of this function over with numerical integration. The estimation running times (in the format hh:mm:ss) are for a machine with Intel Core i5-6500 CPU @ 3.20GHz, 8Gb RAM, Windows 10, Python 3.7.9.
14 Again, we make an assumption that the distributional parameters of the candidates do not change as we add more candidates.
15 The costs of the phases are from DiMasi et al. [Citation4]. Their estimates are based on a confidential survey of manufacturers for 68 randomly selected compounds including one vaccine. Waye et al. [Citation15] point out the development costs of the vaccines can differ from those of the drugs, and Gouglas et al. [Citation5] provide the estimates of the phase I and phase II costs for vaccines specifically. However, we stick to the numbers from DiMasi et al. [Citation4] since Gouglas et al. [Citation5] do not contain the phase III data. The durations of the phases are from Pronker et al. [Citation10].