11,467
Views
3
CrossRef citations to date
0
Altmetric
Assessment Procedures

Inter-rater reliability, intra-rater reliability and internal consistency of the Brisbane Evidence-Based Language Test

, , ORCID Icon, , ORCID Icon, , & show all
Pages 637-645 | Received 25 Jan 2019, Accepted 28 May 2020, Published online: 22 Jun 2020

Abstract

Purpose

To examine the inter-rater reliability, intra-rater reliability, internal consistency and practice effects associated with a new test, the Brisbane Evidence-Based Language Test.

Methods

Reliability estimates were obtained in a repeated-measures design through analysis of clinician video ratings of stroke participants completing the Brisbane Evidence-Based Language Test. Inter-rater reliability was determined by comparing 15 independent clinicians’ scores of 15 randomly selected videos. Intra-rater reliability was determined by comparing two clinicians’ scores of 35 videos when re-scored after a two-week interval.

Results

Intraclass correlation coefficient (ICC) analysis demonstrated almost perfect inter-rater reliability (0.995; 95% confidence interval: 0.990–0.998), intra-rater reliability (0.994; 95% confidence interval: 0.989–0.997) and internal consistency (Cronbach’s α = 0.940 (95% confidence interval: 0.920–1.0)). Almost perfect correlations (0.998; 95% confidence interval: 0.995–0.999) between face-to-face and video ratings were obtained.

Conclusion

The Brisbane Evidence-Based Language Test demonstrates almost perfect inter-rater reliability, intra-rater reliability and internal consistency. High correlation coefficients and narrow confidence intervals demonstrated minimal practice effects with scoring or influence of years of clinical experience on test scores. Almost perfect correlations between face-to-face and video scoring methods indicate these reliability estimates have direct application to everyday practice. The test is available from brisbanetest.org.

    Implications for Rehabilitation

  • The Brisbane Evidence-Based Language Test is a new measure for the assessment of acquired language disorders.

  • The Brisbane Evidence-Based Language Test demonstrated almost perfect inter-rater reliability, intra-rater reliability and internal consistency.

  • High reliability estimates and narrow confidence intervals indicated that test ratings vary minimally when administered by clinicians of different experience levels, or different levels of familiarity with the new measure.

  • The test is a reliable measure of language performance for use in clinical practice and research.

Introduction

Reliable identification of acquired language disorders (aphasia) is a core component of healthcare [Citation1]. Substantial functional disability caused by language impairment features prominently in healthcare decision-making [Citation2]. During the recovery phase, reliable monitoring of language abilities provides an accurate gauge of patient recovery [Citation2]. A deterioration in language performance may indicate a worsening medical condition, such as post-stroke haemorrhagic transformation [Citation3], or conversely, a detected improvement in language skill may indicate betterment in functioning and a response to therapy or intervention. Reliability in language measurement is pivotal in determining treatment effectiveness in research trials, gauging individual patient recovery, and informing critical clinical decisions such as the need for medical intervention or determining the need for referral and assistance post-discharge. Such factors rely heavily upon accurate, reliable assessment of language performance and a patient’s ability to communicate [Citation4].

The Brisbane Evidence-Based Language Test (Brisbane EBLT) (brisbanetest.org) is a new adult language test [Citation5]. The test is intended to provide an evidence-based, psychometrically robust alternative to informal or non-diagnostically validated language measures used in stroke care [Citation6,Citation7] and comprehensive formalised tests which are reported to be too lengthy for use in some clinical contexts (e.g., acute hospital ward). The Brisbane EBLT aims to provide a comprehensive, yet user-friendly and efficient new measure to assist in the identification of language deficits within a range of clinical contexts, including the hospital bedside [Citation5]. The 49 subtest Brisbane EBLT is the full version of the assessment, evaluating language across the severity spectrum in the following language domains: verbal expression including repetition, automatic speech, spontaneous speech (picture description), naming, auditory comprehension, actions/gesture, reading, and writing. Certain subtests require the use of two of each of the following everyday objects: cup, spoon, pen and knife. An additional “Perceptual” subtest examines abilities not requiring a verbal or written response (e.g., object to picture matching). Adapted scores and shorter test versions allow the test to adjust to individual patient need and varying clinical settings. This study is the second of two psychometric investigations of this new measure. Test development and diagnostic accuracy analysis examining the test’s ability to identify aphasia within acute stroke populations have been described elsewhere (brisbanetest.org) [Citation5]. The aim of this study is to report on the inter-rater reliability, intra-rater reliability, internal consistency and practice effects associated with this new measure.

Materials and methods

Study design

Reliability analysis was completed in a concurrent inter-rater and intra-rater repeated measures study design. All clinician raters, stroke participants (or authorised next of kin) provided informed written consent prior to study participation. This study received ethical approval from The University of Queensland Behavioural & Social Sciences Ethical Review Committee (2013000948) and Metro South Human Research Ethics Committee (HREC/14/QPAH/138). This paper is written in accordance with published Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [Citation8]. The GRRAS guidelines are EQUATOR network guidelines (Enhancing the QUAlity and Transparency Of health Research) of widely accepted criteria for the rigorous reporting of sample selection, study design and statistical analysis in reliability research [Citation8].

Sample size justification

No pilot data for inter-rater ICC existed therefore the expected ICC was assumed to be 0.8 [Citation9,Citation10]. As the amount of between-rater variance could not be estimated, the number of simulations (R) = 10 000 was used for inter-rater sample size calculation. When R is large, the highest precision of estimation of the ICC is achieved when the number of participants approximates the number of raters. Therefore, with an average 95% confidence interval (CI) of the ICC based on 10 000 simulations (p = 0.8) a total of 15 participants and 15 clinician raters were required to make the width of the CI less than 0.3 (lower bound 0.610; upper bound 0.898) width = 0.288 [Citation9,Citation10]. This equated to a total of 225 test ratings.

For intra-rater sample size calculation, the criterion value of 0.8 was used to determine the number of consecutive measurements required per clinician rater [Citation9]. To obtain 80% power at the 5% significance level two clinician raters were required to complete 2 ratings on 35 participants [Citation9] after a 2-week interval. This equated to 70 ratings per clinician and a total of 140 Brisbane EBLT ratings.

Participants

Inter-rater reliability analysis required 15 stroke patient participants and 15 clinician raters. Intra-rater analysis required 35 stroke patient participants and two clinician raters who were required to complete their ratings twice. In total, 15 clinicians were recruited as two of the 15 clinicians from the inter-rater reliability study (both with >5 years’ experience) went on to complete a second round of ratings for the intra-rater analysis.

Stroke participants

Reliability participants were acute stroke patients randomly sampled from a larger cross-sectional diagnostic accuracy study of 100 study participants [Citation5]. Patients in this larger diagnostic study were consecutive stroke admissions from 21 January to 15 December 2015 at two large tertiary hospitals in Brisbane, Australia. All patients were screened within 2 days of hospital admission. Participants were eligible to participate if they were admitted for ischaemic or haemorrhagic stroke management and deemed sufficiently medically and cognitively able to undergo language assessment if the following criteria were met: aged >14 years; native-level English language ability in both written and spoken language; sustained level of consciousness for >10 min; (cognitive functioning was pragmatically assessed based on a patient’s ability to participate in, engage with and complete the required language tasks); absence of any precluding acute medical condition as per treating medical team; and with confirmed stroke site of lesion within the left frontal, parietal, temporal, occipital, limbic or insular lobes, internal capsule, thalamus (including thalamic nuclei), and basal ganglia (caudiate nucleus, putamen, globus pallidus, substantia nigra, nucleus accumbens, and subthalamic nucleus). To optimise test external validity, the presence of common post-stroke non-language but communication-related conditions (affecting vision, hearing, speaking, or writing) such as hemianopia, hemiparesis, dysarthria or apraxia of speech was not used as an exclusionary criterion. For these patients, the presence of these co-occurring conditions was noted, and language test items affected by these conditions were recorded as missing data. Patients with subarachnoid haemorrhage or lesions isolated to the right cerebral hemisphere, right midbrain or subcortical regions, or below were not included [Citation5,Citation11].

All 100 recruited stroke patients were video recorded as they were administered the full 49 subtest Brisbane EBLT. Participants wore lapel microphones and were audio-recorded during the assessment to ensure all patient responses were accurately captured. The test was administered by one of two new-graduate qualified clinicians (speech pathologists) both of whom were familiar with the Brisbane EBLT’s administration guidelines (brisbanetest.org). A randomized sample of these 100 video recordings was selected for reliability analysis.

Participant video sampling method and strata size calculation

Videos used for reliability analysis were selected via stratified randomisation sampling [Citation12]. The Brisbane EBLT total score obtained from the original face to face clinician ratings provided a single rating which demonstrated no floor or ceiling effects with scores ranging from 7 to 215 (out of a possible 0 to 258). This score was therefore used to provide a universal control for the covariate influence of language test performance [Citation13,Citation14]. Proportional allocation was used to ensure the selected sample in each stratum level was representative of the larger 100 participant group [Citation15]. The same strata levels were applied to both inter-rater and intra-rater reliability studies, however separate simple randomisation was applied to each. Selected videos within each stratum were then randomized. Selected videos and audio-recordings were checked for sound and video quality. If video positioning or poor recording quality impacted on the ability to accurately rate patient performance these videos were discarded and alternative videos were randomly selected from the sample via the same sampling method.

Clinician raters

A total of 15 clinicians were recruited to complete reliability analysis. Clinician raters (speech pathologists) were recruited via purposeful sampling based on their level of clinical expertise (5 with <5 years’ experience; 5 with 5–10 years’ experience and 5 with >10 years’ experience). Raters were recruited via clinical and research contacts to include clinicians with experience within stroke and non-stroke clinical practice and research. All 15 clinicians participated in the inter-rater reliability analysis and two of these clinicians (both with >5 years’ experience) went on further to participate in the intra-rater reliability analysis by completing each of their ratings twice (after a 2 week interval).

Procedure

Stroke participant Brisbane EBLT videos and audio recordings were collected and randomized prior to the commencement of the reliability analysis. Recruited clinician raters signed study consent forms and were given headphones, access to the participant video and audio recordings, paper copies of the Brisbane EBLT and a copy of the Brisbane EBLT test Administration and Scoring Guidelines (brisbanetest.org) [Citation5]. A photocopy of the stroke participant’s written responses to the Brisbane EBLT writing subtests was provided to each clinician as is reflective of a usual clinical environment and as these were difficult to visualise fully and score via video alone.

Prior to commencing the video ratings, all recruited clinicians were unfamiliar with the Brisbane EBLT. Each clinician was provided with one practice video to watch and score in order to familiarise themselves with the new test. These scores were not included within the analysis. The same practice video was given to all raters. After completing the video, clinicians were given the opportunity to ask questions about the general study procedure (e.g., questions relating to the procedure of watching the videos or factors relating to steps in completing the study). Clinician raters were given only the Brisbane EBLT test form (which includes information on scoring specific test items) and the test Administration and Scoring Guidelines form (which provides general scoring guidance) to assist their marking of patient responses. Clinicians were not provided with any specific Brisbane EBLT training or scoring guidance by the research staff prior to or during the reliability ratings (e.g., the research team did not provide any verbal suggestions of how to score items). The absence of any additional test-specific training (beyond that provided on the test forms) was to ensure the psychometric findings would replicate usual clinical practice, when clinicians would not have any specific training prior to using the test and have to rely on the Brisbane EBLT test form and Administration and Scoring Guidelines form to guide their marking of patient responses. To replicate the usual clinical environment, clinicians were asked to refrain from repeatedly re-watching sections of videos which may be ambiguous due to clinical reasons (i.e., ambiguous patient response). If, however reduced video or audio quality affected scoring, clinicians were instructed to re-watch that section as needed to obtain as accurate a rating as possible.

The 15 inter-rater reliability clinicians watched the same randomised 15 participant videos. The order of the videos was individually randomized for each clinician. Two clinicians went on to participate in the intra-rater reliability study, and watched an additional 20 videos each, bringing the total number to 35. After a two-week interval, these two clinicians each re-watched the same 35 videos in the same randomized order. The two-week interval was selected to ensure clinicians could logistically complete the 70 videos within a 2 month time period. As the schedule required each clinician to watch a minimum of 12 videos before returning and re-scoring the first participant video, any carry-over effect was considered minimal.

Reliability ratings were completed across four independent healthcare sites. No clinician rater knew or had met all other raters in the study. All clinicians completed their ratings independently, were blinded to the reference standard result, other clinicians’ ratings and their own prior ratings (where applicable). Clinicians were instructed to score all administered test items as per the scoring guidelines. If test items were mistakenly left blank or missed, the forms were returned and clinicians were asked to score these items (e.g., one clinician accidently (unintentionally) left a whole section of the test unscored and this was returned to the clinician who was asked to score these items).

Statistical analysis

Reliability correlations were performed for the 45 language subtests, the four self-report questions, the five section totals and overall Brisbane EBLT score. While Brisbane EBLT test scores are discrete, the underlying construct being measured (language functioning) was considered a continuous variable. Data was examined for normality and homogeneity of variance to ensure it fulfilled the criteria for parametric tests. Ninety-five percent confidence intervals were calculated for each reliability coefficient.

Inter-rater analysis (degree of agreement among different raters) at the Brisbane EBLT subtest level involved different reliability coefficients dependent upon the number of possible participant responses. Binary questions and questions with up to 3 different possible answer types were analysed using Fleiss’s kappa [Citation16] as indicated when analysis involves only a few possible rating levels [Citation17]. An Intraclass Correlation Coefficient (ICC) (two-way random-effects model) was used for questions with multiple possible rating categories and for ordinal variables with >4 possible outcome responses [Citation17]. ICC scores range from 0 to 1 and represent the proportion of the variation in the ratings that is due to the performance of the participant under evaluation rather than factors such as how the rater interprets the rubric. An ICC of 1 indicates perfect agreement whereas a 0 indicates no agreement [Citation17]. Mean inter-rater agreement, the probability for a randomly selected participant, that two randomly selected raters would agree was also calculated for each subtest. Complete percentage agreement across all 15 raters was also determined [Citation17].

Intra-rater reliability (consistency of scoring by a single rater) for each Brisbane EBLT subtest was also examined using Intraclass Correlation Coefficient (ICC) measures of agreement. An ICC 3k (mixed effect model) was used to determine the consistency of clinician scoring over time. Binary questions (nominal variables) (e.g., yes/no self-report questions) were analysed using a multilevel mixed-effect logistic regression for binomial responses. Ordinal variables (questions with >2 possible participant response types) were analysed using ICC mixed effect model. In addition, potential practice effects, manifested as changes in Brisbane EBLT clinician scoring performance due to increased familiarity with the assessment or potential fatigue effects were also examined [Citation18].

Cronbach’s alpha was used to determine the internal consistency of the Brisbane EBLT. Values range between 0 and 1 with highly correlated test items resulting in a higher value of alpha [Citation19]. Finally, the mode of test administration was evaluated to assess for any potential difference between face-to-face scoring and scores obtained from clinicians’ rating via participant video. An ICC 2,1 two-way random effects model was used to determine if scores obtained across the two mediums were comparable. All statistical analyses were completed using StataIC 13 and correlation index interpreted according to Landis and Koch [Citation20] guidelines for reliability coefficients: slight agreement (0.0–0.20), fair agreement (0.21–0.40), moderate agreement (0.41–0.60), substantial agreement (0.61–0.80), and almost perfect agreement (0.81–1.00).

Results

Participants (stroke patient videos)

Fifteen inter-rater videos and 35 intra-rater participant videos were selected via randomised stratified sampling based on Brisbane EBLT language ability as per sample size requirements. Randomised participant videos were on average 48.09 min long and ranged from 31 to 71 min in length. Stratification levels and the number of allocated participants per strata are listed in . Characteristics of the randomised inter-rater and intra-rater reliability stroke participants are described in .

Table 1. Stratification levels of participant sample by Brisbane EBLT score.

Table 2. Characteristics of the stroke participant sample.

Clinician raters

Fifteen clinicians participated in the study of which five had <5 years’ experience, five between 5 and 10 years’ experience and five had >10 years clinical experience. All 15 clinicians were female, and all participated in the inter-rater analysis. Two clinicians (<5 years’ experience) participated in both the inter-rater and intra-rater video ratings. Recruited clinicians included 7 acute hospital clinicians; 5 PhD research students and 3 research staff. Characteristics of the clinician raters are described in .

Table 3. Characteristics of inter-rater and intra-rater clinician raters (n = 15).

Normality of the data

The Brisbane EBLT contains a total of 49 subtests which vary in level of task difficulty. Questions range from simple tasks (where most participants achieved a full score) to difficult tasks (where a minority achieve a score). As such, data at the individual subtest level does not follow a normal distribution. While data transformations were attempted this did not influence the normality of the subtest distributions or distributions of the residuals. However non-normal residuals in multilevel modelling with large sample sizes have been shown to have little or no effect on the parameter estimates [Citation21]. Clinically, subtests are not interpreted in isolation and therefore the overall test normality and homogeneity of variance is instead used to ensure this dataset fills the criteria for parametric tests. The data consistently demonstrates almost perfect ICC correlations, consequently, despite the non-normality of the residual distribution, if there was spurious increase in the correlation estimates the data would still display significantly high correlations [Citation21].

Missing data

Brisbane EBLT scoring guidelines direct clinicians not to penalise due to non-language related deficits. For test items where a co-occurring condition (e.g., severe apraxia of speech, dysarthria, hemianopia or hemiparesis) resulted in inability to determine language functioning, clinicians are directed to leave items blank. The decision as to whether test items were affected by severe co-occurring conditions and to leave test items blank was based on the clinical judgement of each clinician rater. These blank scores were statistically treated as missing data and not included in the analysis. As less than 5% of the data was missing this was considered to have negligible effect on correlation estimates [Citation22].

Estimate of reliability including measures of statistical uncertainty

Inter-rater reliability analysis

Inter-class correlation coefficient (ICC) analysis demonstrated almost perfect agreement (0.995; 95%CI: 0.990–0.998) when comparing 15 clinician total Brisbane EBLT scores of 15 acute stroke subjects (total 225 test ratings) [Citation20]. Inter-rater reliability analysis was also completed at the Brisbane EBLT subtest level. Subtest correlations are listed in . Fleiss’s kappa was calculated for 30 Brisbane EBLT questions with <3 possible response types [Citation16] and was found to demonstrate substantial agreement (0.7165) with an average mean percentage inter-rater agreement of 92% and complete agreement of 76%. Inter-rater ICC and complete and mean percentage agreement were calculated for subtests with >4 possible response types. Subtest ICC estimates ranged from substantial 0.704 to almost perfect 0.994 agreement. The average ICC correlation of 0.704 indicated substantial agreement across all relevant Brisbane EBLT subtests [Citation20].

Table 4. Inter-rater reliability per Brisbane EBLT subtest.

Intra-rater reliability analysis and practice effects

Intra-rater reliability involved the analysis of two clinicians’ scores of 35 videorecorded participants when re-scored after a 2-week interval. ICC analysis demonstrated almost perfect intra-rater agreement (0.994; 95% CI: 0.989–0.997) of the test ratings over time (total 140 test ratings) [Citation20]. Subtest level intra-rater correlations were all almost perfect ranging from 0.822 (95%CI: 0.721–0.892) to 1 (95%CI: NA) [Citation20]. ICC intra-rater subtest results are shown in .

Table 5. Intra-rater reliability per Brisbane EBLT subtest.

Clinician raters were unfamiliar with the Brisbane EBLT prior to completing test ratings. Intra-rater consistency estimates therefore can be interpreted in the context of practice effects in clinicians’ scoring evidenced by changes in scoring style or method as a consequence of becoming familiar with the new test. The almost perfect consistency in test ratings between clinician results obtained from their first video rating, and their re-rating of the same video 35 participants later demonstrated there was limited clinician practice effect evident in Brisbane EBLT test scores.

Internal consistency

The Brisbane EBLT subtests demonstrated almost perfect internal consistency with a Cronbach’s alpha of 0.940 (95%CI: 0.920–1.0) [Citation23]. A high Cronbach’s alpha is regarded as >0.80 which demonstrates each subtest is examining the same underlying construct and contributing additional information to the overall total score [Citation23].

Mode of delivery

To ensure scores obtained from video ratings are comparable to typical clinical face-to-face scoring methods, a comparison between scores obtained across these modalities was completed. An ICC (2,1 two-way random effects model) was used to compare clinician face-to-face scores obtained from the previous diagnostic accuracy study with inter-rater video scores obtained in the present reliability analysis. Results indicated almost perfect agreement (ICC 0.998; 95%CI: 0.995–0.999) between test results obtained from these different scoring methods when scoring the same acute stroke participant [Citation20].

Discussion

The aim of this study was to examine the inter-rater reliability, intra-rater reliability and internal consistency of the Brisbane EBLT. Practice effects and the impact of the mode of delivery of clinician ratings (video versus face-to-face scoring methods) were also evaluated. Results demonstrated the Brisbane EBLT total score has almost perfect inter-rater (0.995; 95%CI: 0.990–0.998) and intra-rater reliability (0.994; 95%CI: 0.989–0.997) [Citation20]. Cronbach’s alpha estimate was also high (0.940; 95%CI: 0.920–1.0), indicating strong internal consistency [Citation23].

Clinicians with a range of experience levels participated in the study. The almost perfect inter-rater estimates and narrow confidence intervals found across all fifteen clinician scores (irrespective of expertise level) indicate that prior experience has negligible impact on test score. All raters were unfamiliar with the Brisbane EBLT prior to completing ratings. High intra-rater reliability estimates between initial and subsequent scores demonstrate there were minimal practice effects associated with clinicians becoming familiar with the new assessment. These results have direct implications for clinical practice and research and indicate that experienced and newly-qualified clinicians as well as clinicians new to the assessment and those highly familiar with the Brisbane EBLT will record similar scores when evaluating the same participant. Finally, comparison of clinician results of the same stroke participant obtained from face-to-face scoring and those obtained from watching participant videos also demonstrated almost perfect correlations (ICC 0.998; 95%CI: 0.995–0.999) [Citation20], indicating the video reliability results obtained in this study have application for everyday face-to-face clinical practice.

Comparison with other research

The Brisbane EBLT is a new measure, and as yet there are no studies with which to compare this study’s reliability estimates. Historically however, a number of existing published language tests are used with high frequency among stroke clinicians [Citation6]. The Western Aphasia Battery (WAB) (and WAB-R) [Citation24], Comprehensive Aphasia Test (CAT) [Citation25], Measure for Cognitive Linguistic Abilities (MCLA) [Citation26] and Boston Diagnostic Aphasia Examination (BDAE) [Citation27] are some of the most commonly used language measures used in stroke care [Citation6].

While the WAB-R [Citation24] and BDAE [Citation27] have no published reliability estimates with stroke populations, the WAB [Citation24], CAT [Citation25] and MCLA [Citation26] have undergone this reliability analysis. Historically the WAB is one of the most frequently used language measures both within clinical practice and research [Citation6]. WAB inter-rater reliability was examined through the analysis of eight judges (five speech pathologists; two psychometricians and one neurologist) scores of 10 participants of “various types and severities” [Citation24, p.95] who had been videotaped while completing the WAB. Average intercorrelation of the judges’ ratings was found to be extremely high (≥ 0.98) [Citation24]. WAB intra-rater reliability analysis also reported significantly high correlations (≥ 0.79) when comparing three examiners’ scores of 10 participants when re-assessed “several months” apart [Citation24, p.94]. Similar inter-rater analysis was completed for the CAT [Citation25]. In this study, videotapes of four participants representing “a range of severity and aphasia types” [Citation25, p.111] were scored independently by five raters (two doctors; three speech pathologists). ICC analysis demonstrated excellent inter-rater agreement (0.722–1.00) for all subtest scores [Citation25]. Inter-rater reliability of the MCLA [Citation26] has also been analysed. In this study, scores of two different raters were compared for a subtest of a normative (non-brain damaged) population. Pearson correlation coefficients indicated high levels of reliability (0.90048–1.00) [Citation26].

Methodologically however, these studies were completed prior to the publication of reliability reporting guidelines [Citation8]. While the WAB and CAT inter-rater studies [Citation24,Citation25] documented the raters’ professions, this was absent for the WAB intra-rater study [Citation24] and for the MCLA [Citation26]. The method of statistical analysis was not reported for the WAB, nor was the time interval between the intra-rater ratings [Citation24]. Sampling methods for either the clinician raters or the study participants were not described for any study nor were the demographic characteristics of the participant samples (e.g., age, gender, stroke type). While the CAT reported that inter-rater reliability ratings were completed independently [Citation25], this was absent for the WAB [Citation24] or MCLA [Citation26]. Reliability estimates were also based on limited study samples [Citation28] of 20 test ratings (CAT inter-rater analysis) [Citation25], 80 ratings (WAB inter-rater analysis) [Citation24] and 30 ratings (WAB intra-rater analysis) [Citation24]. All studies lacked reporting of a priori sample size calculation to ensure adequate statistical power [Citation29]. Incomplete adherence to quality and reporting criteria means the true reliability of these measures is difficult to ascertain. Compromised methodological quality, such as the absence of blinding of assessors and use of small study sample sizes may spuriously inflate reliability estimates [Citation29]. As such, true test reliability estimates could be substantially lower than those reported when applied within either clinical or research populations which differ from those used within the initial study conditions. This outcome may have significant implications, not only for clinical practice, but also for research, where excess in measurement errors adversely influences the sample size needed, overall study cost, and the power to detect a true treatment effect [Citation30].

Strengths and weaknesses

A strength of this reliability study is in the methodology used and adherence to the published Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [Citation8]. A priori sample size calculations were completed for both inter-rater and intra-rater reliability analysis and equated to 225 and 140 test ratings respectively. Clinician raters were purposefully sampled to include clinicians from multiple centres with varying backgrounds and expertise and were blinded to their own, others’ ratings and the reference standard. In addition, the participant sample was a randomly selected heterogeneous cohort, stratified based on language level to represent a range of abilities, including those with and without language impairment as is typical of stroke populations. The high inter-rater reliability estimates found in the current study suggest that Brisbane EBLT test scores are not significantly altered by the location or experience level of clinicians. The generalisability of the result is strengthened by the varied clinical characteristics of the stroke participants, the diversity of clinician raters and the absence of any Brisbane EBLT scorer guidance or training, all of which reflect typical real-world everyday practice [Citation30].

Findings of this study need to be interpreted in the context of a number of factors. Firstly, given the absence of an existing published reference standard language test which assesses language across the severity spectrum, stratification of participants’ language ability was based on performance on the index measure, the Brisbane EBLT, the inherent reliability of which may have influenced the stratification process. Secondly, while clinicians were stratified for experience level, they were not randomly selected from the wider professional population. Finally, reliability estimates were obtained using ratings from videoed participant performance. While this method is considered one of the most realistic methods for collecting participant data for reliability studies and controls for the variation in clinician scoring alone [Citation31], the mode of evaluation varies from that of a typical clinical setting. ICC scores obtained across these two rating methods demonstrated almost perfect correlation, a finding supported by previous research [Citation32]. The impact of this mode of delivery on clinician test ratings was therefore considered to be minimal.

Inter-rater reliability estimates obtained at Brisbane EBLT subtest level demonstrated variable levels of reliability. These lower estimates however occurred due to limitations of the statistical characteristics of correlation estimates and do not reflect poor reliability of the Brisbane EBLT language measure. Subtests analysed using the kappa statistic were influenced by the prevalence of ratings within subtest samples, resulting in low estimates despite near perfect agreement [Citation33]. This is a well-documented limitation of this reliability coefficient [Citation33–35]. For these subtests, percentage agreement is a more accurate estimation of true correlation for these variables [Citation17,Citation34]. Conversely, lower percentage agreement for variables with multiple response options was more accurately reflected by ICC estimates [Citation17]. Clinically, reliability estimates based at the subtest level are not typically examined in isolation and the overall test score provides a more representative portrayal of the reliability of the measure when used in practice.

Conclusion

The Brisbane EBLT was found to demonstrate almost perfect reliability when tested by a variety of different clinicians with a range of stroke participants. Findings of this study suggest that Brisbane EBLT test ratings of the same patient will vary minimally when scored by different clinicians, or by the same clinicians at different times. These findings have direct implications for clinical practice and indicate that when a change in test performance is detected, this likely reflects a true difference in patient language ability as test scores are minimally influenced by measurement error. These study results support the use of the Brisbane EBLT as an evidence-based alternative to existing language measures and provide a psychometrically robust assessment of language performance for use within clinical practice and research. The Brisbane EBLT is available for download from brisbanetest.org.

Sources of funding and role of funders

This work was supported by the Australian Stroke Foundation; Equity Trustees Wealth Services Ltd.; Royal Brisbane and Women’s Hospital; and Royal Brisbane and Women’s Hospital Foundation. The funders had no role in design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Acknowledgements

The speech pathology department at the Royal Brisbane and Women’s Hospital made this study possible. The authors thank the stroke patients, clinicians and other study participants who contributed to this research. Full list of acknowledgements is available at brisbanetest.org.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • National Stroke Foundation. Clinical guidelines for stroke management. Melbourne (VIC); 2010.
  • Maas MB, Lev MJ, Ay H, et al. The prognosis for aphasia in stroke. J Stroke Cerebrovasc Disc. 2012;21:350–357.
  • Feteke R, Jeevan D, Marks SJ, et al. Hemorrhagic transformation of ischaemic stroke in patient treated with rivaroxaban. J Hematol. 2013;2:48–50.
  • Spreen O, Risser AH. Assessment of aphasia. New York (NY): Oxford University Press; 2003.
  • Rohde A, Doi S, Worrall L, et al. Development and diagnostic validation of the Brisbane Evidence-Based Language Test. Disabil Rehabil. 2020. Available from: https://mc.manuscriptcentral.com/dandr
  • Vogel AP, Maruff P, Morgan AT. Evaluation of communication assessment practices during the acute stages post stroke. J Eval Clin Pract. 2010;16:1183–1188.
  • Rohde A, Worrall L, Godecke E, et al. Diagnosis of aphasia in stroke populations: a systematic review of language tests. PLoS One. 2018;13:e0194143.
  • Kottner J, Audigé L, Brorson S, et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64:96–106.
  • Eliasziw M, Young SL, Woodbury MG, et al. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys Ther. 1994;74:777–788.
  • Saito Y, Sozu T, Hamada C, et al. Effective number of subjects and number of raters for inter-rater reliability studies. Stat Med. 2006;25:1547–1560.
  • Binder JR, Frost JA, Hammeke TA, et al. Human brain language areas identified by functional magnetic resonance imaging. J Neurosci. 1997;17:353–362.
  • Altman DG, Bland JM. Statistics notes: how to randomise. BMJ. 1999;319:703–704.
  • Alonzo TA, Pepe MS. Using a combination of reference tests to assess the accuracy of a new diagnostic test. Statist Med. 1999;18:2987–3003.
  • Kernan WN, Viscoli CM, Makuch RW, et al. Stratified randomization for clinical trials. J Clin Epidemiol. 1999;52:19–26.
  • Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. New York (NY): Springer; 2003.
  • Fleiss JL. Reliability of measurement. In: Fleiss JL. editor. Design and analysis of clinical experiments. New York (NY): John Wiley & Sons; 1986. p.1–32.
  • Graham M, Milanowski A, Miller J. Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Center for Educator Compensation Reform; 2012. p. 1–33. Available from: files.eric.ed.gov/fulltext/ED532068.pdf
  • Cohen JA, Fischer JS, Bolibrush DM, et al. Intrarater and interrater reliability of the MS functional composite outcome measure. Neurology. 2000;54:802–806.
  • Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ. 2011;2:53–55.
  • Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174.
  • Maas CJM, Hox JJ. The influence of violations of assumptions on multilevel parameter estimates and their standard errors. Comput Stat Data Anal. 2004;46:427–440.
  • Schafer JL. Multiple imputation: a primer. Stat Methods Med Res. 1999;8:3–15.
  • Bland JM, Altman DG. Cronbach’s alpha. BMJ. 1997;314:572.
  • Kertesz A. Western aphasia battery – revised. San Antonio (TX): Hardcourt Assessment; 2007.
  • Howard D, Swinburn K, Porter G. Comprehensive aphasia test. New York (NY): Psychology Press; 2004.
  • Ellmo W, Graser J, Krchnavek B. Measure of cognitive-linguistic abilities (MCLA). Norcross (GA): The Speech Bin, Incorporated; 1995.
  • Goodglass H, Kaplan E, Barresi B. Boston diagnostic aphasia examination. 3rd ed. Baltimore (MD): Lippincott Williams & Wilkins; 2001.
  • Kline P. The handbook of psychological testing. 2nd ed. London (UK): Routledge; 2000.
  • Button KS, Ioannidis JPA, Mokrysz C, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14:365–376.
  • Berg K, Wood-Dauphinee S, Williams JI. The balance scale: reliability assessment with elderly residents and patients with an acute stroke. Scand J Rehab Med. 1995;27:27–36.
  • Fawcett AL. Principles of assessment and outcome measurement for occupational therapists and physiotherapists: theory, skills and application. Chichester (UK): Wiley; 2007.
  • Theodoros D, Hill A, Russell T, et al. Assessing acquired language disorders in adults via the internet. Telemed J E Health. 2008;14:552–559.
  • Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ. 1992;304:1491–1494.
  • Hand PJ, Haisma JA, Kwan J, et al. Interobserver agreement for the bedside clinical assessment of suspected stroke. Stroke. 2006;37:776–780.
  • Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549.