7,721
Views
7
CrossRef citations to date
0
Altmetric
Article Commentary

The Big Data Paradox in Clinical Practice

ORCID Icon
Pages 567-576 | Received 11 May 2022, Accepted 27 May 2022, Published online: 08 Jun 2022

Abstract

The big data paradox is a real-world phenomenon whereby as the number of patients enrolled in a study increases, the probability that the confidence intervals from that study will include the truth decreases. This occurs in both observational and experimental studies, including randomized clinical trials, and should always be considered when clinicians are interpreting research data. Furthermore, as data quantity continues to increase in today’s era of big data, the paradox is becoming more pernicious. Herein, I consider three mechanisms that underlie this paradox, as well as three potential strategies to mitigate it: (1) improving data quality; (2) anticipating and modeling patient heterogeneity; (3) including the systematic error, not just the variance, in the estimation of error intervals.

Introduction

Counterintuitively, as the size of a study increases, the probability that its confidence intervals will include the truth decreases (Citation1). Thus, studies with large sample sizes can yield misleadingly narrow confidence intervals around heavily biased results. Because large quantities of data are now available to researchers, this phenomenon has been labeled the “big data paradox” and is a major culprit behind the unreliable predictions made from the 2016 United States general election polling datasets (Citation2), Google Flu Trends (Citation3), and Delphi–Facebook, i.e. the largest public health survey ever conducted in the United States to date (Citation4). All study types, including randomized controlled trials (RCTs) are susceptible to this paradox (Citation5). We therefore need to understand why it occurs and how to best address it when interpreting research studies.

The map is not the territory

A key concept behind the big data paradox was introduced in the early twentieth century by the Polish American scientist and philosopher Alfred Korzybski who noted that “the map is not the territory” (Citation6). This means that all abstract models (“maps”) derived from an objective truth (the “territory”) are not the truth itself. We should thus not confuse our models of reality with reality itself. In theory, one may counterargue that as our models become progressively more elaborate, we will one day perfectly represent reality. The Argentinian writer Jorge Luis Borges explained in his one-paragraph short story “On Exactitude in Science” why our models can never perfectly encapsulate the truth (Citation7). He described an Empire whose cartographers were able to make a perfect map that was an exact duplicate in size and complexity of the Empire itself. The map was too cumbersome for any practical use and was immediately abandoned. This story illustrates why our models cannot, and should not, perfectly mirror the underlying reality. The perfect map reproducing precisely every detail of the territory would be as large and complex as the territory itself, down to every atom, which is physically impossible. But even if we could create such flawlessly complete maps, the purpose of maps is not to be perfectly precise but rather to help users navigate a terrain. For example, a map that is small enough to be portable will be more useful to a traveler than a perfectly scaled map of the same size as the terrain. Modern technology does allow users to carry highly detailed digital maps capable of quickly changing scales. However, depending on the task at hand, users will change the scale, resolution, and representations provided by the digital map to focus on specific points of interest such as major road arteries or COVID-19 prevalence. Similarly, scientists devise models of the world that accurately describe a system at a given resolution, but then revise these models to describe properties existing at different resolutions (Citation8,Citation9). For example, tobacco smoking should be included as a cause of lung cancer at the population level but can be omitted at the resolution of a single individual patient with lung cancer who has never been exposed to tobacco (Citation10).

The map-terrain analogy parallels the way we tailor how statistical models may represent the underlying reality based on the task at hand. For explanatory tasks, statistical models are specifically designed to test causal hypotheses when applied to data (Citation11,Citation12). Such models are developed based on our theories about the aspects of the underlying reality (the terrain) that we are interested in explaining. For example, the regression model used for the analysis of an RCT dataset may include an interaction term to test the hypothesis that a biological biomarker influences the treatment effect at the hazard ratio scale. This was the approach used to identify a treatment interaction signal between the degree of HER2 overexpression and trastuzumab efficacy in patients with HER2 positive breast cancer (Citation13). For predictive tasks, we are instead interested in mapping certain inputs, such as the prostate-specific antigen levels of a patient, to specific outputs such as the probability that the patient has prostate cancer. Statistical models used for prediction can range from elementary calculations to supervised learning algorithms, such as neural networks, that do not need to provide directly interpretable information on the underlying relationship (the terrain) between the inputs and the outputs (Citation11,Citation12,Citation14).

Borges’s story also powerfully illustrates why our models should balance complexity with simplicity (also known as parsimony). This balancing act was summarized in G. E. P. Box’s famous quote “All models are wrong, but some are useful,” which dictates that our statistical models should contain sufficient complexity to properly represent the underlying reality, and at the same time be parsimonious enough to be useful for the task at hand (Citation15). The models will always be wrong because there will always exist some distance between the underlying truth beyond our senses and our model of the truth. This distance between the truth and our model of the truth that occurs due to constraints and other assumptions imposed by our statistical model is called “bias” or “systematic error.”

Bias-variance trade-off

Patients analyzed by highly parsimonious statistical models are considered interchangeable with each other. Thus, large numbers of patients can be included, resulting in low variance (low “random error”) of the estimates produced by such models. Conversely, more complex models will have lower bias (lower “systematic error”) but also higher variance because the effective sample size is reduced (Citation16). For example, an RCT may enroll an ethnically and racially diverse cohort of men and women to determine the relative treatment effect at the hazard ratio (HR) scale of one treatment versus another (). A parsimonious model will estimate one HR with lower variance, because it used the entire cohort, and have higher bias, because it does not account for how the HR varies by sex or race/ethnicity (). Conversely, a more complex model that estimates a different HR depending on each patient’s sex and race/ethnicity will have lower bias but higher variance because each estimated HR will have a lower effective sample size available (). The optimal level of complexity of a statistical model balances this trade-off between bias and variance. In clinical practice, the bias-variance trade-off can be reframed as the patient relevance-robustness tradeoff, which recasts bias and variance into terms that are intuitively applicable to patient-centered inferences (Citation17). Lower bias in our statistical models results in higher patient relevance, whereas lower variance yields higher robustness of our inferences (). Conceptually, variance exists only because statistical models omit to include biases that are present at different resolution levels. In this sense, the concept of bias may be more fundamental than pure variance, which can only exist at the infinite resolution level (Citation17).

Figure 1. The bias-variance trade-off in randomized controlled trials (RCTs). The trial and model are designed based on prior knowledge, which informs the choice of the statistical test to be used, the variables to be included in the model, and whether or not to focus on parsimony or account for anticipated patient heterogeneity. For Bayesian models, prior probabilities are also needed. Once the RCT is activated, it may enroll a diverse patient population. The final data may then be analyzed using either a parsimonious (A) or a more complex (B) statistical model that accounts for patient heterogeneity. (A) In the first scenario, all of the enrolled patients are used to estimate one parameter for the relative treatment effect measured at the HR scale. The estimated HR will have higher bias (higher systematic error) because it assumes the heterogeneous enrolled patient population to be homogeneous, but lower variance (lower random error) because it uses every enrolled patient. (B) In the second scenario, the more complex model generates multiple HRs that take into account the diverse subgroups of patients enrolled, thus resulting in lower bias (lower systematic error). However, each estimated HR is based on a lower effective sample size, resulting in higher variance (higher random error). Figure adapted from images created with BioRender.com.

Figure 1. The bias-variance trade-off in randomized controlled trials (RCTs). The trial and model are designed based on prior knowledge, which informs the choice of the statistical test to be used, the variables to be included in the model, and whether or not to focus on parsimony or account for anticipated patient heterogeneity. For Bayesian models, prior probabilities are also needed. Once the RCT is activated, it may enroll a diverse patient population. The final data may then be analyzed using either a parsimonious (A) or a more complex (B) statistical model that accounts for patient heterogeneity. (A) In the first scenario, all of the enrolled patients are used to estimate one parameter for the relative treatment effect measured at the HR scale. The estimated HR will have higher bias (higher systematic error) because it assumes the heterogeneous enrolled patient population to be homogeneous, but lower variance (lower random error) because it uses every enrolled patient. (B) In the second scenario, the more complex model generates multiple HRs that take into account the diverse subgroups of patients enrolled, thus resulting in lower bias (lower systematic error). However, each estimated HR is based on a lower effective sample size, resulting in higher variance (higher random error). Figure adapted from images created with BioRender.com.

Table 1. Glossary of key terms.

There are clear parallels between the bias-variance trade-off and the balance between the complexity and usefulness of a map representing a terrain, described above. This is because low variance means that our measurements are closer to each other, thus increasing the ability of an estimated parameter, such as an HR, to reliably predict the future outcomes of patients treated with one therapy versus another. Therefore, estimated HRs with lower variance will be more useful for clinical practice than those with high variance. The relationship between the variance and bias of an estimated parameter, such as an HR, can be expressed by the equation: (1) Mean squared error=(bias)2+variance(1)

The square root of the mean squared error is known as the “reducible error” or the “root-mean-square error” (RMSE) of the estimated parameter, whereas the square root of the variance is known as the “standard error.” The confidence intervals (CIs) for estimated parameters such as the HR are functions of the standard error. More specifically, the standard error is equivalent to the 68.27% CI, whereas the commonly used 95% CI is equivalent to 1.96 × standard error. Therefore, as the standard error decreases, the CIs around an estimated parameter will become more narrower (Citation18,Citation19). EquationEquation (1) can be rewritten as a Pythagorean sum-of-squares relation: (2) (RMSE)2=(systematic error)2+(standard error)2 (2)

The RMSE (reducible error), systematic error (bias), and standard error used in Equationequation (2) are all estimated using the same units as the parameter of interest. They can therefore be used to generate intuitive error intervals for the reducible error that represent our uncertainty regarding the true value of the parameter of interest more accurately than traditional CIs. A key task during data analysis is to determine the optimal statistical modeling strategy that minimizes the reducible error by decreasing the standard error without strongly increasing the systematic error (Citation20,Citation21). Furthermore, other metrics, such as the mean absolute error, may be used instead of the RMSE to represent the reducible error (Citation19,Citation20,Citation22). The advantage of RMSE is that it considers larger deviations from the truth as worse than smaller deviations, and it is easy to interpret because it uses the same unit scale as the parameter of interest.

In addition to the reducible error that can be estimated by the data using the RMSE metric, there exists an “irreducible error” in measuring the parameter of interest that is caused by unknown elements not represented within the dataset (20). For example, an unmeasured biomarker may influence the HR in some patient subsets, or unidentified data quality issues, such as poor data curation and integration across multiple treatment centers, may increase the noise in our measurements. Thus, the total error in our estimate of a parameter will be the sum of both the reducible and the irreducible errors: (3) Total error=reducible error+irreducible error(3)

The big data paradox

All else being equal, the standard error will decrease and the CIs will narrow as the sample size of a study increases (Citation2,Citation5,Citation18,Citation19,Citation23). The big data paradox can emerge because, as the CIs become narrower, the probability that they will include the truth decreases (). The wide CIs of a small RCT may include the true value of the parameter of interest and properly reflect the uncertainty in its estimation (). The practical disadvantage of small studies with wide CIs is that the parameter values included within these intervals may be widely divergent and thus not useful in clinical practice. For example, the data from a small RCT that yielded an HR of 0.72 with a 95% CI of 0.23–2.20 are statistically compatible at the 0.95 level with a very strong effect size (HR = 0.23) favoring the new treatment over the control and a very strong effect size (HR = 2.20) favoring the control over the new treatment. These inconclusive results cannot inform clinical practice even if the true parameter value is included within these CIs.

Figure 2. The big data paradox. (A) Our target is the underlying true parameter, but its complexity can never be perfectly modeled (B) Therefore, there will always be some distance between our models of the true parameter and the true parameter itself. This distance is the sum of the bias (blue line) due to constraints and other assumptions imposed by the statistical model, and of the irreducible error (brown line) due to unknown elements that are not represented within the dataset. (C) When the sample size of a study is small, then the variance is large, yielding wide confidence intervals around the estimated parameter that are more likely to include the true parameter value. (D) If the same study has a larger sample size, then the reduced variance will yield more narrow confidence intervals that are less likely to include the true parameter.

Figure 2. The big data paradox. (A) Our target is the underlying true parameter, but its complexity can never be perfectly modeled (B) Therefore, there will always be some distance between our models of the true parameter and the true parameter itself. This distance is the sum of the bias (blue line) due to constraints and other assumptions imposed by the statistical model, and of the irreducible error (brown line) due to unknown elements that are not represented within the dataset. (C) When the sample size of a study is small, then the variance is large, yielding wide confidence intervals around the estimated parameter that are more likely to include the true parameter value. (D) If the same study has a larger sample size, then the reduced variance will yield more narrow confidence intervals that are less likely to include the true parameter.

Figure 3. Potential causes of the big data paradox in an RCT. (A) when the sample size of the RCT is small, the corresponding wide CIs may include the true parameter of interest, e.g. the true HR for the relative treatment effect of one treatment versus another. (B) As the sample size of the RCT increases, the quality of the gathered data may decrease, resulting in increased irreducible error within the dataset even if the enrolled patient cohort was highly homogeneous and the statistical model bias remains the same. This can be addressed by maintaining high data quality throughout the study. (C) As the sample size of the RCT increases, the heterogeneity of the patient cohort may also increase, resulting in both increased bias and increased irreducible error, even if the data quality remains consistently high throughout the study. This can be addressed by both anticipating and modeling patient heterogeneity in the estimation of the HR. (D) As the sample size of the RCT increases, the reducible error of the estimated parameter is dominated by the systematic error (bias) and is not properly represented by CIs that are a function of the standard error (variance). This occurs even if the enrolled patient cohort is highly homogeneous and data quality remains consistently high throughout the study. It can be addressed via the use of error intervals that represent the reducible error due to both systematic and standard error.

Figure 3. Potential causes of the big data paradox in an RCT. (A) when the sample size of the RCT is small, the corresponding wide CIs may include the true parameter of interest, e.g. the true HR for the relative treatment effect of one treatment versus another. (B) As the sample size of the RCT increases, the quality of the gathered data may decrease, resulting in increased irreducible error within the dataset even if the enrolled patient cohort was highly homogeneous and the statistical model bias remains the same. This can be addressed by maintaining high data quality throughout the study. (C) As the sample size of the RCT increases, the heterogeneity of the patient cohort may also increase, resulting in both increased bias and increased irreducible error, even if the data quality remains consistently high throughout the study. This can be addressed by both anticipating and modeling patient heterogeneity in the estimation of the HR. (D) As the sample size of the RCT increases, the reducible error of the estimated parameter is dominated by the systematic error (bias) and is not properly represented by CIs that are a function of the standard error (variance). This occurs even if the enrolled patient cohort is highly homogeneous and data quality remains consistently high throughout the study. It can be addressed via the use of error intervals that represent the reducible error due to both systematic and standard error.

There is therefore good rationale for increasing the sample size and thereby reducing the variance of RCT estimates. Large studies, however, are susceptible to the big data paradox. In real-world applications, such as RCTs, there are three major nonmutually exclusive mechanisms behind this paradox (). The first is due to the decreased data quality that can occur as more patients enroll in a study (). Multicenter RCTs conducted across diverse institutions, regions, and continents for long time periods are particularly susceptible to this mechanism. It can be addressed by placing more emphasis on data quality rather than focusing solely on data quantity (Citation2,Citation4,Citation24). Data quality is also important in the analysis of large real-world prospective or retrospective observational studies, which may be more prone than RCTs to errors in data collection (Citation25). Data quality can be improved by explicitly providing more details on the origins of the data used in the analysis, including the steps involved in data selection, pre-processing, curation, and provenance (Citation24). In his recent overview on enhancing data quality, Meng (24) termed this quality inspection process “data minding,” and its transparent reporting “data confession.” For example, it can be disclosed that a baseline variable obtained from a complete blood count such as the neutrophil-to-lymphocyte ratio was used as a proxy for more direct measurement of systemic inflammatory markers because it is an easily accessible and affordable metric that can easily be adopted in clinical practice (Citation26,Citation27).

The second mechanism through which the big data paradox can occur is via increasing patient heterogeneity as more patients are enrolled in a trial (). The sources of this heterogeneity can be either measured within the dataset or unmeasured. Measured patient heterogeneity increases bias because the parameter of interest may vary depending on the diverse attributes of each patient. In addition, heterogeneous patient cohorts are more likely to harbor unmeasured attributes such as unknown biomarkers that can influence parameter estimation thus increasing the irreducible error. The measured and unmeasured effects of increasing patient heterogeneity can be mitigated by carefully anticipating and measuring sources of potential irreducible error prior and during data collection, and by increasing the statistical model flexibility to account for measured patient heterogeneity thus reducing model bias (Citation23,Citation24,Citation28–30). The shrinkage yielded by multilevel modeling, such as Empirical Bayes techniques, typically provides the optimal balance between bias and variance by reducing the RMSE more than other modeling approaches (Citation17,Citation19,Citation31). Thus, multilevel (also known as “hierarchical”) modeling is a reliable strategy to account for the measured patient heterogeneity introduced by increasing sample sizes (Citation32,Citation33). This is because multilevel models provide a coherent framework to investigate for signals in the data while maximizing the use of information across multiple resolultions (levels) thus minimizing artifacts from random noise (Citation34). Multilevel models contain multiple levels within a nested hierarchy allowing the borrowing of information across the various subsets (Citation35). For example, a prognostic risk model for renal cell carcinoma developed using patients from medical centers in the United States may be a subset of a larger multilevel model which contains patients from medical centers in Canada, Europe, and Asia. These subsets may themselves contains groups defined by country, or deeper levels such as treatment center. The full model (superset) can borrow information across subsets to distinguish contrasts and commonalities thus providing superior estimates of the prognostic effects of each variable of interest. In the clinical trial setting, the ideal scenario would be to conduct small RCTs tailored to specific attributes of a very homogeneous patient cohort. Each of these trials would be nested within a multilevel model that allows the borrowing of information across trials and patient subsets (Citation32).

Careful anticipation and statistical modeling of patient heterogeneity and other sources of structural causal bias introduced by increasing sample sizes is also crucial during the analysis of large, real-world observational datasets (Citation19,Citation36–38). Structural causal biases are sources of systematic (measured) and irreducible (unmeasured) errors that can be represented by causal diagrams (Citation36,Citation37,Citation39–43). A prior commentary detailed how causal diagrams can help interrogate structural causal biases in a hypothetical RCT testing the efficacy of erlotinib, an epidermal growth factor receptor (EGFR) inhibitor, compared with control chemotherapy in patients with non-small cell lung cancer (NSCLC) (Citation42). Only NSCLC tumors that harbor specific oncogenic EGFR gene mutations are driven by the EGFR pathway and are thus affected by erlotinib (Citation44). If the RCT does not collect any information on the EGFR mutation status of the NSCLC tumors for each patient treated on the trial, then the treatment effect heterogeneity introduced by such differences in EGFR mutation status between patients will be perceived as irreducible noise. Conversely, if the EGFR mutation status is determined for each patient, then the final dataset can be used to properly model the effect of this biomarker on the RCT outcome. In some scenarios, however, strong sources of heterogeneity will remain unknown and thus unmeasured. For this reason, it is prudent to carefully collect correlative blood and tissue samples from patients enrolled on RCTs and other prospective clinical studies. These biospecimens can then be used for fundamental basic and translational research to discover previously unknown sources of treatment effect heterogeneity and accordingly guide the rational design of subsequent clinical trials (Citation45).

The last mechanism of the big data paradox occurs even when the additional patients enrolled do not increase the bias or the irreducible error of the RCT (). As the sample size of the RCT increases, the standard error shrinks, and the total error of the estimated parameter is mainly due to irreducible error and systematic error (bias). Because CIs are a function of the standard error and not of the total error, they will become misleadingly narrow, instilling a false sense of confidence regarding the accuracy of the estimated parameter. It has accordingly been suggested to replace the term “confidence” in CIs with the more modest label “compatibility intervals” (Citation46). Furthermore, the big data paradox that occurs via this mechanism can be addressed by replacing the traditional calculation of CIs with error intervals that represent the reducible error that can occur due to both systematic and standard error. The current approach of focusing only on the standard error implies that this error is more costly than systematic error. However, high systematic error can lead to perniciously false inferences and unreliable predictions even when standard error is low (Citation2–4). Error intervals that are based on reducible error metrics, such as the RMSE, will more properly consider systematic and standard errors to be similarly costly. For example, prediction intervals, which can be defined as the intervals between ±2 × RMSE (Citation47,Citation48), incorporate both systematic and random error. Thus, whereas confidence intervals narrow in inverse proportion to the square root of the effective sample size, the width of the prediction intervals will remain virtually unaffected (Citation49). But even after accounting for all reducible error in the data, we should maintain epistemic humility because there will always exist some degree of irreducible error that cannot be quantified and accurately represented by error intervals.

Conclusions

In summary, the big data paradox is becoming an increasingly common challenge due to the larger scale of contemporary research. Experimental studies such as RCTs are not immune to this paradox, and efforts should be made to mitigate it via three different strategies: (1) greater focus on data quality; (2) more flexible statistical modeling that anticipates and accounts for the increased heterogeneity of larger patient cohorts; (3) replacement of CIs with intervals that represent both the systematic and standard errors.

Acknowledgements

The author thank Drs Bora Lim (Associate Professor, Baylor College of Medicine, Houston, TX, USA) and Christopher Logothetis (Professor, The University of Texas MD Anderson Cancer Center, Houston, TX, USA) for helpful conversations, as well as Sarah Townsend (Senior Technical Writer; Department of Genitourinary Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA) for editorial assistance.

Disclosure statement

Pavlos Msaouel reports honoraria for scientific advisory boards membership for Mirati Therapeutics, Bristol Myers Squibb, and Exelixis; consulting fees from Axiom Healthcare; non-branded educational programs supported by Exelixis and Pfizer; leadership or fiduciary roles as a Medical Steering Committee member for the Kidney Cancer Association and a Kidney Cancer Scientific Advisory Board member for KCCure; and research funding from Takeda, Bristol Myers Squibb, Mirati Therapeutics, and Gateway for Cancer Research.

Additional information

Funding

Pavlos Msaouel is supported by a Career Development Award from the American Society of Clinical Oncology, a Research Award from KCCure, the MD Anderson Khalifa Scholar Award, the MD Anderson Physician-Scientist Award, philanthropic donations by Mike and Mary Allen, and the Andrew Sabin Family Foundation Fellowship.

References