234
Views
0
CrossRef citations to date
0
Altmetric
Nose/Sinus

Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

ORCID Icon, , , , , , , , , , & ORCID Icon show all
Pages 237-242 | Received 16 Apr 2024, Accepted 03 May 2024, Published online: 23 May 2024
 

Abstract

Background

Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.

Aims/objectives

Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).

Material and methods

Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.

Results

LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants’ answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.

Conclusions and significance

Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.

Chinese Abstract

背景

大型语言模式(LLM)可能为训练有素的健康工作者的缺乏提供解决方案, 特别是在低收入和中等收入国家。 然而, 这种模式的优点和缺点仍不清楚。

目的

在这里, 我们将不同的大型语言模式(Bard2023.07.13、Claude2、ChatGPT4)与六位耳鼻喉科 (ORL) 顾问医生比较而进行基准化。

材料和方法

基于案例的问题摘自文献和德国政府考试。 采用 6 点李克特量表, 对Bard2023.07.13、Claude2、ChatGPT4、六位ORL顾问医生的回答随机评级, 来衡量医学充分性、可理解性、连贯性和简洁性。将给出的答案与经过验证的答案进行比较并评估风险。 进行了修改后的图灵测试并比较字符计数。

结果

在所有类别中, 大型语言模式的答案的等级均低于顾问医生们的答案。 顾问医生们和大型语言模式之间的区别很小, 在简洁性方面差异最明显, 在可理解性方面差异最小。在大型语言模式中, Claude2 在医疗充分性和简洁性方面被评为最佳。 顾问医生们的答案与经过验证的解决方案的匹配率为 93% (228/246), ChatGPT4 的匹配率为 85% (35/41), Claude2 的匹配率为 78% (32/41), Bard2023.07.13 的匹配率为59% (24/41)。 ChatGPT4 的答案被评为有风险性的为 10% (24/246), Claude2 为14% (34/246), Bard2023.07.13 为 19% (46/264), 顾问医生们为 6% (71/1230)。

结论和意义

尽管顾问医生们表现出色, 大型语言模式仍显示出耳鼻喉科临床应用的潜力, 未来的研究应该更大规模地评估它们的表现。

Disclosure statement

SK is the founder and shareholder of MED.digital.