1,021
Views
0
CrossRef citations to date
0
Altmetric
Editorial

The poisoning of big data: using large data registries for research in toxicology

, &

KEYWORDS NPDS; retrospective chart review; methodology

Large data registries are frequently used for toxicosurvillance and research in medical toxicology given the ease of data acquisition, geographic diversity, and large registry of cases. The National Poison Data System (NPDS), notably, is the largest and most frequently published data registry of poison related exposures in the United States [Citation1]. These data, and the resultant studies, are important for detecting and responding to novel trends in chemical and poison exposures. However, datasets with large volumes and high frequency of observations, often referred to as “Big Data,” are prone to errors, omissions, and bias – particularly when the final studied populations are small. How do we in the toxicology community interpret these studies? Here, we discuss the strengths, limitations, and complexities of using large data registries for research in medical toxicology, with a particular focus on NPDS given its prominence in the toxicology literature. In addition, we offer methodological suggestions for investigators to reduce bias and improve data transparency when using the large data registries for clinical research.

National poison data collection began in 1983, and since 2003, has been digitized into a web-based interface [Citation2,Citation3]. NPDS currently collects case data from all 55 poison control centers across the country and US territories. Case information is uploaded roughly every 8 minutes in real-time, and the database contains over 74 million case records [Citation1,Citation3]. The real-time uploading of case information has made the NPDS a useful tool for public health surveillance, as it is able to identify novel geographical and temporal trends in exposures [Citation4]. For example, the Centers for Disease Control and Prevention (CDC) uses NPDS to identify potential public health threats and early markers of incidents of public health significance, as well as to enhance situational awareness and inform public health responses during a suspected or known public health threat [Citation5]. To its credit, the volume and diversity of cases are impressive, and it takes significant effort to collect and organize the data. But what about its inherent quality for the purpose of clinical research?

Specialists in Poison Information (SPIs) and Poison Information Providers (PIPs) answer all poison center calls, entering case information into one of several available electronic medical record systems approved by the AAPCC [Citation3]. Certain demographics are collected and reported uniformly (e.g. age, gender, geographic region), but additional clinical data often end up in a non-standardized free text area and represent the very data that clinicians are most interested to study. The AAPCC publishes guidelines outlining appropriate coding practices and provides online training for SPIs, but this does not always include the non-standardized areas and it is up to individual poison centers to ensure oversight and quality control for their data.

To obtain follow-up information, many poison centers utilize students, residents, and fellows who do not have any specific training in data collection. Thus, the collected information which is subsequently coded by SPIs can result in errors of omission, commission, and inappropriate inclusion, with the ability to code something as “other” instead of a more specific code. Limited resources preclude most individual poison centers from reviewing every entry in their electronic medical record database, which results in a high degree of variability in the amount, type, and quality of data that eventually make their way into the NPDS coding fields.

As with all large data registries, there are inherent strengths and limitations of each system that should be considered when interpreting published data registry studies. In comparison to other databases, such as the Toxicology Investigators Consortium (ToxIC) registry or the National Poisons Information System in the United Kingdom (NPIS), the NPDS has a more broad and geographically diverse catchment area. In addition, NPDS provides uniformity and internal consistency in their reported data due to the use of a singular data collection form in the final database [Citation6]. However, the NPDS database makes no distinction between clinical information obtained from a patient’s family or provided by a trained toxicologist. Admittedly, this is also true for the other toxicology databases, which also rely on second- or even third-hand information. Cases in the ToxIC registry are more frequently associated with bedside consultation, which can enhance the scope and accuracy of the collected data. However, this also significantly limits the number of potential participating centers for ToxIC to larger academic institutions, resulting in the omission of many exposures and the underrepresentation of incidence data.

Large data registries like the NPDS, ToxIC, and NPIS, also share several important systematic limitations. Cases are reported to poison control centers or bedside consultation services on a voluntary basis, leading to significant variability in the number and types of cases based on local practice patterns, which results in reporting bias [Citation4]. Ironically, exposures managed by independent or non-participating toxicology consulting services may not even make it into any of the registries. In addition, while large data registries acknowledge reporting only numerator data (i.e. counts of events), studies that report numerator data over several years, such as incidence rates, may fail to account for coinciding denominator changes over time. Consequently, studies that claim statistically significant temporal associations may simply be measuring changes in consultation utilization or magnified population shifts. It is important for the toxicology community to consider these limitations when evaluating the rigor of registry studies.

However, some surveillance systems use other denominators that can provide greater data granularity. In contrast to the NPDS and NPIS, which use population as the denominator, the Researched Abuse, Diversion and Addiction Related Surveillance (RADARS) System frequently uses the ‘unique recipients of dispensed drug’ (URDD) denominator [Citation7]. The URDD is defined as the number of unique individuals that obtain a drug from a pharmacy, which provides context for the frequency of healthcare events in the setting of drug availability [Citation8]. Other denominators have also been used, such as number of written prescriptions, weight of drug sold, and dosage units sold. Each denominator provides useful information depending on the context, but the resulting data signal is often more meaningful than simple population data. While this addresses a significant limitation of the other registries, RADARS primarily focuses on drugs that are misused, abused, or diverted, which limits its scope and applicability to other exposures.

Still, large data registries can be a useful resource when paired with appropriate methodology, data-driven conclusions, and transparent limitations. NPDS, for example, is well suited for large cohorts when investigators are seeking to describe general trends in exposures, or descriptive studies utilizing basic demographic data that is generally less prone to errors. While the conclusions from such studies remain limited to epidemiologic trends about incidence of calls based on product codes or incidence of clinical complications, they are useful for hypothesis generation and a starting point for further, more detailed, rigorous studies. Investigators should be cautious in drawing conclusions from coded symptoms and treatment, especially suggesting causality between an exposure and clinical effect or outcome given the multitude of potential errors with these data.

An excellent example of high-quality data in a large cohort NPDS study is one by Chatham-Stephens et al, which examined trends in e-cigarette exposures [Citation9]. In this study, the authors queried the NPDS database for all e-cigarette and e-liquid exposures over a four year period and extracted basic demographic information from the initial data set. The authors subsequently examined all cases that were coded as having a “moderate” or “major” outcome and requested the free text notes from individual poison centers for each of those cases. Additional uncoded data were then abstracted from these free text notes, thus providing a more complete picture of factors that may have contributed to more serious outcomes. The additional step of reviewing the free text notes, although labor intensive, added important information to the initial data set, and enhanced the internal validity of the study.

When using large data registries for smaller cohort studies, we’ve previously used and advocate for a two-pronged data acquisition approach. First, investigators should identify the initial cohort using the database. Then, the investigators should request and review individual hospital or poison center records for each case to assess for coding errors, substance errors, and to account for any symptoms, treatment or outcomes that were omitted from the coded data. This allows for a more accurate and rigorous description of the cohort. Admittedly, this methodology requires significantly more work and still has significant limitations, but it is a step in the right direction to improve overall accuracy of the reported data. No registry is perfect, and limitations need to be honestly acknowledged, but reasonable efforts should be made to be more rigorous in our approach.

Since their inception, large toxicology registries have been an important data source for tracking exposures, identifying novel trends, and generating interesting hypotheses. As such, many important studies have been published using registry data. However, the quality of these studies, and the conclusions drawn from them, are only as good as the source from which they come. We hope these suggestions will be of use for investigators when designing their studies and for the readership when interpreting them.

Conflicts of interest

The authors [AM, JAL, CGS] have no conflicts of interest to disclose.

Funding

None

References