Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Christoph R. Buhra Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany;b School of Medicine, University of St Andrews, St Andrews, UKCorrespondence[email protected]

https://orcid.org/0000-0002-9551-2310

Harry Smithc School of Computer Science, University of St Andrews, St Andrews, UK

Tilman Huppertza Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany

Katharina Bahr-Hamma Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany

Christoph Matthiasa Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany

Clemens Cunyd Outpatient Clinic, Clemens Cuny, Dieburg, Germany

Jan Phillipp Snijdersd Outpatient Clinic, Clemens Cuny, Dieburg, Germany

Benjamin Philipp Ernste Department of Otorhinolaryngology, University Hospital Frankfurt, Frankfurt, Germany

Andrew Blaikieb School of Medicine, University of St Andrews, St Andrews, UK

Tom Kelseyc School of Computer Science, University of St Andrews, St Andrews, UK

Sebastian Kuhnf Institute for Digital Medicine, Philipps-University Marburg, University Hospital of Giessen and Marburg, Marburg, Germany

Jonas Eckricha Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany

https://orcid.org/0000-0001-5498-4031

show all

Figures & data

Figure 1. Comparison between ORL consultants and the different LLMs (ChatGPT4, Bard 2023.07.13, Claude 2) for all evaluated categories. Data shown as a scatter dot blot with points representing absolute values and bar width representing amount of individual values. Horizontal lines represent mean (95% CI). Normality distribution was tested with the D’Agostino and Pearson test. Multi-group comparisons were performed using the Kruskal-Wallis-Test. ns > .05, *p < .05, **p < .01, ***p < .001, ^****p < .0001.

Figure 2. The number of characters per answer used by ORL consultants and the different LLMs (ChatGPT 4, Bard 2023.07.13, Claude 2). Data shown as a scatter dot blot with each point resembling an absolute value. Horizontal lines represent the median. Normality distribution was tested with the D’Agostino and Pearson test. Multi-group comparisons were performed using the Kruskal-Wallis-Test. ^****p < .001.

Table 1. Correlation Analysis for ratings for Medical Adequacy, Comprehensibility, and Coherence, Conciseness and the number of characters used. (ns >< .05, *p < .01, p < .01, p < .001, ;**p < .0001).

Download CSV Display Table

Supplemental material

Supplemental Material

Download MS Word (26.6 KB)

Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Table 1. Correlation Analysis for ratings for Medical Adequacy, Comprehensibility, and Coherence, Conciseness and the number of characters used. (ns >< .05, *p < .01, p < .01, p < .001, ;**p < .0001).

Supplemental Material

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Figures & data

Table 1. Correlation Analysis for ratings for Medical Adequacy, Comprehensibility, and Coherence, Conciseness and the number of characters used. (ns >< .05, *p < .01, **p < .01, ****p < .001, ;****p < .0001).

Supplemental Material

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Correlation Analysis for ratings for Medical Adequacy, Comprehensibility, and Coherence, Conciseness and the number of characters used. (ns >< .05, *p < .01, p < .01, p < .001, ;**p < .0001).