1,980
Views
5
CrossRef citations to date
0
Altmetric
Commentary

COVID-19 and the epistemology of epidemiological models at the dawn of AI

ORCID Icon
Pages 506-513 | Received 14 Jul 2020, Accepted 15 Oct 2020, Published online: 23 Nov 2020
 

Summary

The models used to estimate disease transmission, susceptibility and severity determine what epidemiology can (and cannot tell) us about COVID-19. These include: ‘model organisms’ chosen for their phylogenetic/aetiological similarities; multivariable statistical models to estimate the strength/direction of (potentially causal) relationships between variables (through ‘causal inference’), and the (past/future) value of unmeasured variables (through ‘classification/prediction’); and a range of modelling techniques to predict beyond the available data (through ‘extrapolation’), compare different hypothetical scenarios (through ‘simulation’), and estimate key features of dynamic processes (through ‘projection’). Each of these models: address different questions using different techniques; involve assumptions that require careful assessment; and are vulnerable to generic and specific biases that can undermine the validity and interpretation of their findings. It is therefore necessary that the models used: can actually address the questions posed; and have been competently applied. In this regard, it is important to stress that extrapolation, simulation and projection cannot offer accurate predictions of future events when the underlying mechanisms (and the contexts involved) are poorly understood and subject to change. Given the importance of understanding such mechanisms/contexts, and the limited opportunity for experimentation during outbreaks of novel diseases, the use of multivariable statistical models to estimate the strength/direction of potentially causal relationships between two variables (and the biases incurred through their misapplication/misinterpretation) warrant particular attention. Such models must be carefully designed to address: ‘selection-collider bias’, ‘unadjusted confounding bias’ and ‘inferential mediator adjustment bias’ – all of which can introduce effects capable of enhancing, masking or reversing the estimated (true) causal relationship between the two variables examined.1 Selection-collider bias occurs when these two variables independently cause a third (the ‘collider’), and when this collider determines/reflects the basis for selection in the analysis. It is likely to affect all incompletely representative samples, although its effects will be most pronounced wherever selection is constrained (e.g. analyses focusing on infected/hospitalised individuals). Unadjusted confounding bias disrupts the estimated (true) causal relationship between two variables when: these share one (or more) common cause(s); and when the effects of these causes have not been adjusted for in the analyses (e.g. whenever confounders are unknown/unmeasured). Inferentially similar biases can occur when: one (or more) variable(s) (or ‘mediators’) fall on the causal path between the two variables examined (i.e. when such mediators are caused by one of the variables and are causes of the other); and when these mediators are adjusted for in the analysis. Such adjustment is commonplace when: mediators are mistaken for confounders; prediction models are mistakenly repurposed for causal inference; or mediator adjustment is used to estimate direct and indirect causal relationships (in a mistaken attempt at ‘mediation analysis’). These three biases are central to ongoing and unresolved epistemological tensions within epidemiology. All have substantive implications for our understanding of COVID-19, and the future application of artificial intelligence to ‘data-driven’ modelling of similar phenomena. Nonetheless, competently applied and carefully interpreted, multivariable statistical models may yet provide sufficient insight into mechanisms and contexts to permit more accurate projections of future disease outbreaks.

This article refers to:
“COVID-19 and the epistemology of epidemiological models at the dawn of AI”: comment from the editors

Acknowledgements

Mark Gilthorpe kindly provided extensive feedback on draft components of this Commentary, which also benefitted enormously from discussions with colleagues in the Leeds Causal Inference Research Group (particularly Kellyn Arnold, Laurie Berrie, Marc de Kamps, Wendy Harrison, John Mbotwa and Peter Tennant) and with Thea de Wet, Bob Mattes, Martin Shakespeare, Steph Leddington, Ian Scaysbrook, Andrew Shepherd, Richard Mason, Steve Braganza and David Moss.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 These biases, and the terminology involved, may be challenging to readers who are unfamiliar with the use of causal path diagrams (such as Directed Acyclic Graphs; DAGs) which have been instrumental in identifying the different roles that variables can play in causal processes (whether as ‘exposures’, ‘outcomes’, ‘confounders’, ‘mediators’, ‘colliders’, ‘competing exposures’ or ‘consequences of the outcome’) and revealing hitherto under-acknowledged sources of bias in analyses designed to support causal inference. For what we hoped might offer accessible introductions to DAGs (and how [not] to use these) please see: Ellison (Citation2020); and Tennant et al. (Citation2019). For more technical detail on ‘collider bias’, ‘unadjusted confounding bias’ and ‘inferential mediator adjustment bias’ (and its related concern, the ‘Table 2 fallacy’), please refer to: Cook and Ranstam Citation2017; Munafò et al. (Citation2018); Tennant et al. (Citation2017); VanderWeele and Arah (Citation2011); and Westreich and Greenland (Citation2013).

2 ‘Model organisms’ are also those selected or developed for investigation/experimentation under controlled (often laboratory-based) conditions.

3 Such ‘predictions’ include the estimation (or classification) of unknown, unmeasured or poorly measured/specified variables either retrospectively (or, at best, in near real time) or prospectively (in the future) based on the information available from other known/measured covariates (so-called ‘predictors’). Both use statistical models (or ‘algorithms’) that have been ‘trained’ on datasets in which the ‘predicted’ variables have been (accurately) measured/specified. While the former better reflects ‘interpolative estimation/classification’ than ‘prediction’ in the literal sense, the latter generates ‘literal predictions/extrapolations’ that are nonetheless very different to the ‘predictive projections’ generated through modelling of the underlying processes theorised (or known) to be involved. In these, robust causal knowledge (both theoretical and empirical) is critical to the accuracy and precision their projections achieve. Widespread misunderstanding of the distinctions between these three forms of ‘prediction’ (‘interpolative estimation/classification’, ‘literal prediction/extrapolation’ and ‘predictive projection’; see Figure S1) underpin their misapplication and misinterpretation, and fuel much of the bias – and many of the errors – that pervade contemporary epidemiology and may yet undermine the application of machine learning and AI therein (Arnold et al. Citation2020).4,5

4 This is why epidemiological best practice should not rely on ‘spotting’ errors and biases, and should instead assume such problems are possible (if not likely), and diligently search for, root out and address these in the same way that parametricians routinely evaluate whether their data are normally distributed and homoscedastic (and thereby comply with two key assumptions of many parametric statistical models). Indeed, contemporary best practice extends the optimisation of parameterisation further by evaluating whether categorisation, transformation or interaction terms are required to maximise the (individual and joint) information that covariates provide to ‘predictive’ models; with the resulting models then subjected to repeated testing and evaluation. Similar diligence is required when selecting which variables to include (and which to exclude) from the ‘covariate adjustment sets’ required to minimise the risk of confounding while avoiding ‘inferential mediator adjustment bias’ in models that support robust causal inference. All such models benefit from careful parameterisation, as well as from a fuller understanding of the questions they can address (and those they cannot).

5 This results from (mis)interpreting the coefficients of individual covariates within outputs from a single (one step) multivariable model (which are commonly those reported in a second Table, hence the fallacy’s name) as evidence of their (independent) causal relationship with the predicted variable of interest. Instead these coefficients represent only the residual contribution each covariate makes to the model after adjustment for both the individual and joint information available from all other included/adjusted covariates – a residual contribution that can deviate in both size and direction from any true causal effect.

6 ‘Inferential mediator adjustment bias’, which results from the inappropriate adjustment for mediators (variables falling on the causal pathway between the speculative cause/exposure and its potential consequence/outcome) in analyses intended to support causal inference, is the bias responsible for the ‘Table 2 fallacy’ (albeit, under those circumstances where the model in ‘Table 2’ was designed for ‘prediction’ and subsequently repurposed/misinterpreted as a suitable basis for causal inference).

7 Indeed, none of the empirical clinical studies (and only a handful of the epidemiological analyses) examined when preparing this Commentary appeared to recognise the important distinction between ‘prediction’ and causal inference (and the different methodological considerations that each require).

8 The authors of both these studies were well aware of these biases in advance of publication, and it is not clear why these biases were neither acknowledged nor competently addressed in the final versions they subsequently published.

9 This view, that interpolative estimation/classification and causal inference are critical to accurate prediction (beyond mere extrapolation) is not without its detractors (Broadbent Citation2015), but will be persuasive to those adopting a more pluralistic positivist approach that charts a path between excessive scepticism and over-reliance on definitive evidence (Fuller Citation2020a).

This article is part of the following collections:
Commentaries in Human Biology

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access
  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 65.00 Add to cart
* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.