69
Views
0
CrossRef citations to date
0
Altmetric
ORIGINAL RESEARCH

Bridging AI and Clinical Practice: Integrating Automated Sleep Scoring Algorithm with Uncertainty-Guided Physician Review

, ORCID Icon, ORCID Icon, , , , & ORCID Icon show all
Pages 555-572 | Received 19 Dec 2023, Accepted 18 Apr 2024, Published online: 28 May 2024

Figures & data

Table 1 Demographic Characteristics of BSDB Subjects with Respect to Individual Data Splits

Table 2 Occurrence of Different Classes of Sleep Disorders Among Conclusive Diagnoses of Subjects per Individual Data Splits of BSDB

Table 3 Measures Evaluating Prediction’s Uncertainty Using U-Sleep Softmax Output

Figure 1 Schematic overview of datasets used, their size, and purpose.

Notes: A set of 13 open-access datasets (in blue) was used for the baseline training of the U-Sleep. The middle and right parts of the schema relate to the evaluations on BSDB. Its ID part refers to PSGs each scored by one of more than 50 assistants and 10 senior physicians. The ID training and validation splits (in yellow) were used to fine-tune U-Sleep and, subsequently, to train the confidence network. Baseline evaluation of both algorithmic approaches was performed on the ID-test data (in orange). Their robustness was further evaluated on two OOD test sets (in red), each containing PSGs scored by a unique SP.
Abbreviations: ID, in-domain; OOD, out-of-domain; SP, senior physician; AP, assistant physician; PSG, polysomnography.
Figure 1 Schematic overview of datasets used, their size, and purpose.

Table 4 Classification Performance of U-Sleep on Individual Data Splits

Table 5 Performance of Uncertainty Measures to Identify U-Sleep Predictions Discerning from Human Scoring on Individual Data Splits

Figure 2 Combined output of the predicted hypnogram (in white) and the associated confidence. TCP-scores (in the background), supplemented with the physician-scored hypnogram (in blue).

Notes: Combined output for a 44-year-old female diagnosed with hypersomnolence. On-subject (Acc, F1w, K) of (79.2, 72.2, 61.5)%, respectively. On-subject average TCP of 0.74. For correctly and incorrectly classified epochs, the average on-subject TCP was 0.87 and 0.41, respectively.
Abbreviation: TCP, true class probability.
Figure 2 Combined output of the predicted hypnogram (in white) and the associated confidence. TCP-scores (in the background), supplemented with the physician-scored hypnogram (in blue).

Figure 3 Schematic overview of the implemented pipeline.

Notes: An EEG-EOG channel-pair is used as an input for the U-Sleep classifier. Using the trained U-Sleep, several representations are extracted (softmax; binary code indexing the predicted class; hidden representations - hiddens - from the layer preceding softmax) and used as an input for the confidence network evaluating the True Class Probability (TCP) confidence score. The hypnogram predicted by U-Sleep (y) is provided jointly with the assessment of predictive uncertainty (1-TCP) to guide an efficient review by physician.
Abbreviations: N, number of epochs; TCP, true class probability; y, U-Sleep predicted sleep-stages.
Figure 3 Schematic overview of the implemented pipeline.

Table 6 Bootstrap Confidence Intervals for Difference of on-Subject Mean-Aggregated Confidence TCP-Scores of Aligning Vs Discordant Predictions

Table 7 Bootstrap Confidence Intervals for Correlation Between on-Subject Mean-Aggregated Confidence TCP-Scores and the Performance Metrics

Figure 4 Performance boost with physician’s review of epochs having confidence TCP-score lower than a given threshold.

Abbreviations: ID, in-domain; OOD, out-of-domain; K, Cohen’s kappa; Acc, accuracy; F1w, weighted F1-score.
Figure 4 Performance boost with physician’s review of epochs having confidence TCP-score lower than a given threshold.

Table 8 Rescoring Amounts Needed to Achieve Desired Levels of Sleep-Scoring Performance

Figure 5 Review amounts (% of epochs exported) versus the % of all discordant predictions gathered.

Abbreviations: ID, in-domain; OOD, out-of-domain.
Figure 5 Review amounts (% of epochs exported) versus the % of all discordant predictions gathered.