33
Views
0
CrossRef citations to date
0
Altmetric
Original Contributions

The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls

, , , , , & show all
Received 20 Apr 2024, Accepted 26 Jun 2024, Published online: 22 Jul 2024
 

Abstract

Objectives

This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google’s Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.

Methods

Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.

Results

Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra’s evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 − 1:51 min) per human chart review, 01:24 min (IQR 01:09 − 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06).

Conclusions

Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.

Acknowledgments

We would like to acknowledge the hardworking women and men of Salt Lake City Fire Department and Unified Fire Authority for their professionalism and dedication to the communities they serve.

Decleration of Generative AI in Scientific Writing

The authors did not use a generative artificial intelligence (AI) tool or service to assis with preparation or editing of this work, other than as stated in the mansucript. The authors take full responsibility for the content of this publication.

Disclosure Statement

Dr. Youngquist reports consulting fees and research funds from Colabs Medical, Inc. The remaining authors report there are no competing interests to declare.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 65.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 85.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.