Abstract
Background: The objective of this study was to develop and validate machine learning models for data entry error detection in a national out-of-hospital cardiac arrest (OHCA) prehospital patient care report database.
Methods: Adult OHCAs of presumed cardiac etiology were included. Data entry errors were defined as discrepancies between the coded data and the free-text note documenting the intervention or event; for example, information that was recorded as “absent” in the coded data but “present” in the free-text note. Machine learning models using the extreme gradient boosting, logistic regression, extreme gradient boosting outlier detection, and K-nearest neighbor outlier detection algorithms for error detection within nine core variables were developed and then validated for each variable.
Results: Among 12,100 OHCAs, the proportion of cases with at least one error type was 16.2%. The area under the receiver operating characteristic curve (AUC) of the best-performing model (model with the highest AUC for each outcome variable) was 0.71–0.95. Machine learning models detected errors most efficiently for outcome place and initial rhythm errors; 82.6% of place errors and 93.8% of initial rhythm errors could be detected while checking 11 and 35% of data, respectively, compared to the strategy of checking all data.
Conclusion: Machine learning models can detect data entry errors in care reports of emergency medical services (EMS) clinicians with acceptable performance and likely can improve the efficiency of the process of data quality control. EMS organizations that provide more prehospital interventions for OHCA patients could have higher error rates and may benefit from the adoption of error-detection models.
Author Contributions
D.H. Choi and J.H. Park had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. S. Kim, J.H. Park, K.J. Song, and S.D. Shin were responsible for the study concept and design. J.H. Park, K.J. Song, and S.D. Shin were responsible for acquisition, analysis, and interpretation of the data. D.H. Choi, Y.H. Choi, and J.H. Park were responsible for the drafting of the manuscript. S.D. Shin, K.J. Song, and S. Kim were responsible for the critical revision of the manuscript for important intellectual content. D.H. Choi, Y.H. Choi, and J.H. Park were responsible for statistical analysis. J.H. Park and S.D. Shin obtained funding. S. Kim, Y.H. Choi, S.D. Shin provided administrative, technical, and material support. S. Kim and S.D. Shin provide study supervision. All authors approved the manuscript.
Disclosure Statement
No potential conflict of interest was reported by the authors.
Correction Statement
This article has been corrected with minor changes. These changes do not impact the academic content of the article.