542
Views
71
CrossRef citations to date
0
Altmetric
Perspectives

Clinical epidemiology in the era of big data: new opportunities, familiar challenges

, , , &
Pages 245-250 | Published online: 27 Apr 2017

Abstract

Routinely recorded health data have evolved from mere by-products of health care delivery or billing into a powerful research tool for studying and improving patient care through clinical epidemiologic research. Big data in the context of epidemiologic research means large interlinkable data sets within a single country or networks of multinational databases. Several Nordic, European, and other multinational collaborations are now well established. Advantages of big data for clinical epidemiology include improved precision of estimates, which is especially important for reassuring (“null”) findings; ability to conduct meaningful analyses in subgroup of patients; and rapid detection of safety signals. Big data will also provide new possibilities for research by enabling access to linked information from biobanks, electronic medical records, patient-reported outcome measures, automatic and semiautomatic electronic monitoring devices, and social media. The sheer amount of data, however, does not eliminate and may even amplify systematic error. Therefore, methodologies addressing systematic error, clinical knowledge, and underlying hypotheses are more important than ever to ensure that the signal is discernable behind the noise.

Introduction

Big data has firmly established itself in the health research,Citation1,Citation2 illustrated by publications in high-ranking general-interest biomedical journals, including The New England Journal of Medicine,Citation3 JAMA,Citation4 Journal of Internal Medicine,Citation5 Science,Citation6Citation9 and Nature.Citation10Citation13 A basic definition of big data includes “the 3 Vs”: variety (linkage of many data sets from heterogeneous independent sources in a single data set); volume (large number of observations and variables per observation from different sources); and/or velocity (real-time or frequent data updates, often fully or partially automated).Citation14 Other definitions encompass additional three Vs: value (clinically relevant information); variability (eg, seasonal or secular disease trends); and veracity (data quality).Citation2 Routinely recorded health data are large automated data sets stemming from day-to-day activities of health care, such as hospital admissions or claims.Citation15Citation18 These data have evolved from mere byproducts of health care delivery or billing into a powerful tool for improving patient care through preventive, etiologic, and prognostic epidemiologic research.Citation4 A recent article summarizes 46 most influential studies conducted with big data in health care,Citation1 while a review from 2015 provides multiple examples of the “variety” V in big data for health.Citation2

The notion of applying lessons from the clinical past to the clinical future is “as old as medicine.”Citation19 In a simplified form, evidence-based medical care means that a clinician can use research results in making treatment decisions in his or her clinical practice, often through explicit literature-based treatment guidelines. For a clinician, this means answers to questions such as: “How likely is my patient with atrial fibrillation on oral anticoagulants to develop a major bleeding? Does the risk vary by type of anticoagulant or patient characteristics?” or “To what extent does comorbidity affect mortality of patients with hip fracture?” To be answered, a clinical question must be first translated into a precise research question and then back-translated and interpreted for clinical decision making. Therefore, it is essential for clinicians and epidemiologists to understand each other’s language. For an epidemiologist, an answer to a research question should be a precise and valid estimate of an underlying population parameter such as mean, risk, incidence rate, or odds ratio. Big data – via the “volume” V – often addresses the precision component, but does little to address validity (the “veracity” V in the big-data vocabulary). Plausible hypotheses, expert knowledge, and accurate measurement tools must be available to ensure validity of research findings, since a highly precise biased result, especially perceived as credible based on precision alone, is more dangerous translated into clinical practice than an imprecise biased result.Citation20,Citation21 This paper, using primarily case studies from the Nordic countries, provides a brief overview and examples of use of big data in clinical epidemiology and outlines associated advantages and challenges.

Examples of big data collaborations in epidemiology

Some say that the digitalization of medical records revolutionized the usability of big data in medical research.Citation4 Whether or not this claim is accepted, it is important to be aware that the current development follows a long evolution of using register data for medical research. This evolution started with the establishment of the first National Leprosy Register, in Norway, in 1856 (),Citation22,Citation23 and of the Danish Cancer Registry, in 1943.Citation24 Other Nordic registries followed, most of them established between the 1960s and the early 2000s.Citation25,Citation26 Researchers in the Nordic countries have been using the volume component of the big data before the term was invented: for decades, epidemiologists have been conducting epidemiologic studies based on linkage of routinely collected data from multiple administrative, health, and demographic registries, and their potential has been recognized at least since the 1990s,Citation27 if not earlier.Citation28

Figure 1 Building that used to house the Norwegian Leprosy Registry, currently home of the Department of Global Public Health and Primary Care, University of Bergen, Norway.

Note: Courtesy: Dr Astrid Lunde.
Figure 1 Building that used to house the Norwegian Leprosy Registry, currently home of the Department of Global Public Health and Primary Care, University of Bergen, Norway.

Estimates of association with narrow confidence intervals often stem from big data analyses of common health outcomes in population-based registry data spanning several decades. When the intervention or the outcome of interest is rare, even data from an entire country may be in sufficient, requiring that data from different countries are combined. Several formal or ad hoc collaborative networks in observational epidemiology have arisen, often from the need to study benefits and risks of relatively uncommon pharmacologicalCitation16,Citation29Citation31 or surgicalCitation32,Citation33 interventions, or vaccines.Citation3,Citation30 Examples of pan-Nordic collaborations using combined data from Denmark, Finland, Iceland, Norway, and SwedenCitation31,Citation34,Citation35 include studies on prenatal exposure to antidepressants and adverse effects in the offspringCitation31,Citation34,Citation35 or the Nordic Arthroplasty Register Association (NARA) database of about 1 million primary hip and knee replacement procedures performed since 1995 in Denmark, Finland, Norway, and Sweden.Citation36 NARA enabled studies of rare risk factors and outcomes, for which single-country data are too sparse.Citation32,Citation33 One clinically relevant question is whether a type of fixation used in total hip replacement (THR) is associated with risk of subsequent revision in patients younger than 55 years of age, since these patients may be different from older patients in mobility, post-THR life expectancy, and compliance with treatment. Only 5% of THR procedures are performed in patients younger than 55 years and previous studies, including those based on national hip registries, had insufficient sample size to address the fixation issue in younger patients. Pedersen et alCitation37 used NARA to assemble a study population of ~30,000 patients younger than 55 years undergoing THR, with each fixation technique represented by more than 3,000 observations. The study yielded a clinically relevant message that uncemented implants are associated with a lower long-term risk of aseptic loosening but a higher short-term risk of revisions. Thus, the purpose of uncemented implants has been achieved in long term, but technical issues causing dislocation, periprostethic fracture, and infection have been previously overlooked in patients younger than 55 years.

Use of routinely collected data for epidemiologic research has also been possible outside the Nordic countries, including general practice-based data in the UK, or claims-based databases and database networks in the USA. In contrast to the typical European health care databases, which are established to fulfill administrative (health services), clinical quality, or surveillance needs, the US claims databases (eg, Medicare, Medicaid, and commercial insurance records) are by-products of medical accounting. Several European database networks, including those encompassing the Nordic data, have been successfully established and have found ways to overcome challenges of differences in the underlying health care systems, languages, data-sharing laws, record-generating mechanisms, and classifications.Citation5,Citation16,Citation30,Citation38,Citation39 Medical data in the Nordic countries are coded using a common basic set of standard classifications (International Classification of Diseases, Nordic Medico-Statistical Committee classification for procedures and causes of injury,Citation40,Citation41 or Anatomical Therapeutic Chemical codes for medications), which makes it easier to establish common algorithms. In the USA, Medicare and Medicaid provide financial incentives for “meaningful use” of electronic health records.Citation3 The most prominent big data collaborative models in the USA have been the Mini-Sentinel project and the Observational Medical Outcomes Partnership (OMOP).Citation3 The difference between routine records accumulated in systems like Mini-Sentinel or OMOP and those in Europe is the structure of the health care system, linkage possibilities, and the availability of lifelong complete follow-up. Thus, certain aspects of big data in Nordic countries are more diverse than those in many other databases (the “volume” V and the “variety” V of the big data), thanks to individual-level linkage to both medical and nonmedical data, including education, income, and residence, and because of lifelong follow-up. In 2013, the Mini-Sentinel project covered 360 million person-years of observation representing 150 million lives.Citation3 In 2014, the Danish Civil Registration System, with its linkable network of national registries, covered 400 million person-years of observation from 9.5 million lives.Citation25 Asian countries are building a linkable registry infrastructure with individual-level linkage mimicking those of the Nordic countries.Citation42

The “variety” V of the big data is developing rapidly, whereby previously unused on underused types of data are incorporated into medical research, including electronic medical records, imaging, biobanks, and patient-reported data (including social media and wearables).Citation2,Citation43 Individual linkage may not be always necessary: in a classical ecologic study, hostility of language on Twitter was associated with country-specific mortality from heart diseases.Citation44 Pharmacovigilance with social media is already a reality.Citation45 Mobile phones can be used to test and subsequently deliver behavioral interventions such as smoking cessation aidCitation46 or adherence support.Citation47 The type of bias associated with certain types of data may change over time. For example, in the early days of epidemiologic research, random landline phone surveys tended to select the relatively more affluent, the employed, and the young. Today, these groups are more likely to be accessed via social networks and mobile telephony,Citation2 while use of landline phones may select for older or disadvantaged population segments.

Assembling database networks carries with it technical, logistical, ethical, and legal challenges.Citation48 The last two are often the hardest to overcome because of issues of data access, patient privacy, and potential conflicts of interest. Even in large studies, one has to remain vigilant about patient privacy and the possibility of inadvertently identifying individuals based on a set of rare characteristics. Gini et alCitation16 provide a practical guide of the different models of data networking, defined on the degree of centralization and harmonization of the different analytic processes. It seems to be practical to designate a single network partner, with adequate resources, to be the coordinating analytic hub. The process starts with raw data from each participating database and ends with the statistical output combining results of individual patients from all databases. Between the starting and the end points, there exist different models for the extent of process automation, autonomy, and control enjoyed by each data partner. A global protocol, with flexibility for local adaptations, is usually followed. Depending on the aims of the study, the analysis may entail as little sharing as contributing country-specific odds ratios for a meta-analysis or as much sharing as harmonization and pooling of individual-level data sets.Citation16 Harmonization involves transformations, whereby each partner creates standard input data sets according to exact specification – a common data model (CDM) – which dictates the data set types and structure, variable names and attributes, and definitions of derived variables. A single statistical analytic program is then run on the CDM-conforming files either by each network partner locally (“one analyst, many outputs”) or centrally by the hub on the combined data set (“one analyst, one output”). By contrast, the “many analysts, many outputs” approach is discouraged because it is prone to error and duplicates work. Whether one or many analysts, quality control of programming by another analyst is always necessary.

Health outcomes measured by health care professionals might differ from the outcomes subjectively experienced by patients, and the latter also affects the outcome of treatment. To fill this gap, patient-reported outcome measures (PROMs) are being used increasingly.Citation49 An example of incorporation of PROMs in a single-country setting, while capitalizing on unique data linkage capabilities common to the Nordic settings, includes the generic infrastructure for collecting PROM data, AmbuFlex, developed in Denmark by Hjollund et al.Citation50 The researchers have successfully implemented a flexible paper-based and electronic data collection on PROMs in more than 20 projects since 2004. Group-level aggregated PROM data, linked with data from routine registries and clinical databases, can be used to monitor national and regional hospital performance in oncology and cardiology care, psychiatry, neurology, and orthopedics. Patient-level PROM data collected on clinic level, in combination with electronic health records, can be used to facilitate screening, clinical decisions, patient–doctor communication, and efficient use of resources in cardiology, rheumatology, and oncology. Response rates exceeded 75% in all and 90% in most cases. A clinical decision support function of PROMs can save clinicians’ time by using an algorithm-based initial identification of patients in need of immediate attention, while presenting data on other patients in a decision-supporting format for clinical judgment.Citation50 AmbuFlex is a unique example of implementation in routine care, a generic system integrated with electronic medical records, and is used for longitudinal collection of detailed PROM data on an individual level to personalize the care for the individual patient. This allows the collection of PROM data on large cohorts of chronically ill patients over many years, similar to the systems currently in place for administrative data.

Big data in epidemiology: benefits and challenges

Precision of results is not the only benefit of big data. Observations from large number of individuals allow a rapid detection of potential risk signals associated with newly marketed therapies, for which risks of rare adverse events are rarely known from Phase III preapproval trials (the velocity “V” of the big data).Citation51 A thought experiment showed that having records of 100 million patients for safety monitoring would have allowed the detection of adverse cardiovascular effects of rofecoxib (Merck, Kenilworth, NJ, USA) in 3 months instead of 5 years.Citation5,Citation52 On the other hand, large data sets help convincingly rule out harmful associations, in the so-called “null studies.” One example is the abovementioned Nordic collaboration on safety of antidepressant use in pregnancy. Less than 2% of pregnant women use selective serotonin reuptake inhibitors (SSRIs) in pregnancy, while birth defects affect about 3% of live births. Therefore it took a pan-Nordic study to assemble a study population of >1.5 million pregnancies with ~73,000 malformation cases, including ~33,000 SSRI-exposed pregnancies with >1,300 cases exposed to SSRIs.Citation34 The study convincingly showed a null association between maternal use of SSRIs and major birth defects, providing reassurance to pregnant women with depression and their physicians. Finally, in analyses based on large data sets, estimates are likely to be “highly statistically significant,” ie, associated with P-values <0.05. This “universal statistical significance” could finally lay to rest reliance on P-values for interpretation of study results, allowing researchers to focus on clinical significance instead.Citation53Citation55

The perks of big data should not go to our collective heads. Big data does not address the usual epidemiologic challenges related to validity, and may even amplify them.Citation15,Citation56 Accurate measurement of study variables remains imperative in big-data settings. An advantage of multinational databases is that estimates originating from different databases to address the same research question amount to reproducibility checks of results under varying assumptions about the record-generating mechanisms and the effects of the underlying health care and social structures. At the same time, in multinational database studies, validity concerns are increased proportional to the number of the databases, with the need of several valid operational definitions for the same clinical characteristic or event, to avoid propagating a systematic error on a large scale.Citation53,Citation56 Validation of algorithms in large secondary databases remains imperative for valid inference.Citation15,Citation56,Citation57 The NARA collaboration has contributed to improvement of data validity in all four participating countries through regular meetings, where differences in registration practice have been discussed. Also, through different research projects, a number of differences regarding data quality between registries have been pointed out and discussed, and subsequently changes in national registries have been made to achieve uniform data definition, collection, and interpretation.

Large amounts of missing data may cause selection bias and undermine gains in precision afforded by big data, since in multiple regression models, standard statistical software removes observations with missing values. Reverse causation, immortal time bias,Citation58 and healthy user/healthy adherer biasCitation59 are likewise not remedied by large amounts of data and need to be addressed in big-data and small-data studies alike. On a pragmatic level, delay of data delivery and changes in coding practice present additional challenges.

Conclusion

Epidemiologic research, including database research, is an “exercise in measurement,”Citation60 in an effort to maximize signal-to-noise ratio. The results of big data-based medical research represent a dividend to the public on its investment in the form of contribution to routine databases with data and with tax money. The advantages of big data are precision of results, including precise “null” findings, ability to address clinical questions in patient subgroups, and rapid detection of risk signals. In the Nordic countries, big data is collected and maintained by public institutions and operate in the setting of income-independent access to health care and lifelong follow-up. In other settings, such as US claims databases, demographic or economic disadvantages are better represented, while follow-up is not lifelong and health care access may be interrupted. Combining evidence from different settings and countries creates multiple-informant settings, providing built-in cross-validation and addressing a wide array of clinical questions in a single study. A formal requirement to the big data is that size, complexity, and velocity of the data are too intense for processing and interpretation with exiting tools. In the Nordic settings, the volume has been available for some decades, and the variety is increasing rapidly to include data on imaging, behavior, geo-location, ecology, genetics, and patient-reported outcomes. Velocity has not yet reached the real-time update stage, but it is improving, and its value is obvious. Veracity (familiar to epidemiologists as validity) needs to be assured before data can be interpreted. The large amount of data, thus, does not eliminate and may amplify sources of systematic error. To that end, technical expertise, clinical knowledge, and underlying hypotheses are more important than ever to ensure that the signal is not drowned out by noise.

Acknowledgments

We thank Professor Olaf M Dekkers for helpful comments on the early drafts of this manuscript and Dr Astrid Lunde for providing the photo for . This paper was funded by the Program for Clinical Research Infrastructure established by the Lundbeck Foundation and the Novo Nordisk Foundation and administered by the Danish Regions.

Disclosure

The authors report no conflicts of interest in this work.

References