5,850
Views
14
CrossRef citations to date
0
Altmetric
Articles

Examining 3-month test-retest reliability and reliable change using the Cambridge Neuropsychological Test Automated Battery

ORCID Icon, ORCID Icon, , , , , ORCID Icon & ORCID Icon show all

Abstract

The Cambridge Neuropsychological Test Automated Battery (CANTAB) is a battery of computerized neuropsychological tests commonly used in Europe in neurology and psychiatry studies, including clinical trials. The purpose of this study was to investigate test-retest reliability and to develop reliable change indices and regression-based change formulas for using the CANTAB in research and practice involving repeated measurement. A sample of 75 healthy adults completed nine CANTAB tests, assessing three domains (i.e., visual learning and memory, executive function, and visual attention) twice over a 3-month period. Wilcoxon signed-rank tests showed significant practice effects for 6 of 14 outcome measures with effect sizes ranging from negligible to medium (Hedge’s g: .15–.40; Cliff’s delta: .09–.39). The Spatial Working Memory test, Attention Switching Task, and Rapid Visual Processing test were the only tests with scores of adequate test-retest reliability. For all outcome measures, Pearson’s and Spearman’s correlation coefficients ranged from .39 to .79. The measurement error surrounding difference scores was large, thus requiring large changes in performance (i.e., 1–2 SDs) in order to interpret a change score as reliable. In the regression equations, test scores from initial testing significantly predicted retest scores for all outcome measures. Age was a significant predictor in several of the equations, while education was a significant predictor in only two of the equations. The adjusted R2 values ranged between .19 and .67. The present study provides results enabling clinicians to make probabilistic statements about change in cognitive functions based on CANTAB test performances.

Introduction

The Cambridge Neuropsychological Test Automated Battery (CANTAB) is a battery of computerized neuropsychological tests measuring multiple cognitive domains (Sahakian & Owen, Citation1992). It is commonly used in Europe in neurology (Ho et al., Citation2003; Williams-Gray, Foltynie, Brayne, Robbins, & Barker, Citation2007), psychiatry (Fried, Hirshfeld-Becker, Petty, Batchelder, & Biederman, Citation2015; Levaux et al., Citation2007), and neuropsychology research for studying diverse conditions, such as fetal alcohol spectrum disorders (Green et al., Citation2009), traumatic brain injury (TBI) (Sterr, Herron, Hayward, & Montaldi, Citation2006), Alzheimer’s disease (O’Connell et al., Citation2004), affective disorders (Sweeney, Kmiec, & Kupfer, Citation2000), and schizophrenia (Hutton et al., Citation2004). It has been used in clinical trials involving treatment for depression (Falconer, Cleland, Fielding, & Reid, Citation2010), schizophrenia (Turner et al., Citation2004), and obsessive-compulsive disorder (Nielen & Den Boer, Citation2003). The CANTAB is being used in CENTER-TBI (Maas et al., Citation2015), a large European project that aims to improve the care for patients with TBI. CENTER-TBI is part of a larger global initiative called the International Initiative for Traumatic Brain Injury Research (InTBIR) with projects currently ongoing in Europe, the United States, and Canada.

The reliability of the CANTAB tests has not been thoroughly examined. Adequate reliability is a fundamental requirement for any test used in neuropsychology, regardless of its purpose (Crawford & Garthwaite, Citation2012). In classical test theory, reliability coefficients indicate the degree to which a test is free from measurement error, and consequently the confidence that clinicians place in test scores. Test-retest reliability concerns the temporal stability of test scores and is of great importance for clinicians tracking change in cognitive functions over time. The Pearson’s correlation coefficient (Pearson’s r) is a value commonly used to estimate test-retest reliability.

When examining change in cognitive functions, clinicians must decide whether an individual’s test score is meaningfully different from a score obtained in a previous evaluation, and not a reflection of measurement error. Several methods are available for this purpose (for a review, see Duff, Citation2012). Two of the most commonly used approaches involve using the reliable change methodology and standardized regression-based formulas. The reliable change methodology was used extensively in clinical psychology (Jacobson, Roberts, Berns, & McGlinchey, Citation1999) prior to being applied to clinical neuropsychology (Chelune, Naugle, Luders, Sedlak, & Awad, Citation1993; Heaton et al., Citation2001; Iverson, Citation2001; Temkin, Heaton, Grant, & Dikmen, Citation1999) and sports neuropsychology (Barr & McCrea, Citation2001; Hinton-Bayre, Geffen, Geffen, McFarland, & Friis, Citation1999; Iverson, Lovell, & Collins, Citation2003). This method involves the calculation of the Reliable Change Index (RCI), which indicates the probability that an observed difference between two test scores reflects measurement error. Because the traditional RCI-approach assumes no benefit of prior exposure to a test, a modification of the formula is recommended in the case of known practice effects (Chelune et al., Citation1993).

The standardized regression-based (SRB) approach involves using linear regression formulas to predict a retest score based on performance at initial testing (McSweeny, Naugle, Chelune, & Luders, Citation1993). This corrects for differential practice effects and regression toward the mean due to imperfect test reliability, as well as for variability in retest scores. Linear regression formulas are extendible to incorporate factors such as sample characteristics (e.g., age, gender, education) and testing schedule variables (e.g., test-retest interval) to predict retest scores. Regression-based change formulas have been used to investigate change in conditions such as epilepsy, TBI, and Parkinson`s disease (Duff, Citation2012).

To our knowledge, only a few studies have explored test-retest reliability of the CANTAB, three in older adults (Cacciamani et al., Citation2018; Goncalves, Pinho, & Simoes, Citation2016; Lowe & Rabbitt, Citation1998) and one in children (Syvaoja et al., Citation2015). Methodological differences (e.g., sample characteristics, administered tests, and test-retest interval) are evident between these studies. Nonetheless, a common finding is weak to moderate test-retest reliability for the majority of outcome measures, and only one of the studies used methods to evaluate change (Goncalves et al., Citation2016). There is a need for studies that compute reliable change statistics to refine the interpretation of the CANTAB in clinical practice. Therefore, the aim of the present study was to investigate the test-retest reliability of nine commonly used CANTAB tests across a three-month interval.

Methods

Participants

The participants were recruited as community controls in a large prospective cohort study on mild traumatic brain injury (MTBI) conducted as a collaboration between St. Olavs Hospital, Trondheim University Hospital and the Norwegian University of Science and Technology. The participants were matched at the group level regarding sex, age, and education to a sample of patients with MTBI. For practical reasons, they were recruited among the hospital and university staff, as well as families and friends of staff and patients with MTBI. Inclusion criteria were ages 16–59 years. Exclusion criteria were (a) non-residency in Norway or non-fluency in the Norwegian language; (b) ongoing severe psychiatric disease requiring treatment (e.g., bipolar disorder, severe depression), severe somatic disease, or substance abuse potentially making follow-up difficult; (c) history of complicated mild, moderate, or severe TBI or other preexisting neurological conditions with visible brain pathology or known cognitive deficits; and (d) MTBI in the last three months. One participant was excluded at the first visit due to a severe psychiatric disorder and one was excluded due to an unexpected MRI finding. Out of 81 participants who were assessed at the first visit, 75 returned for the second assessment and completed all tests. Only subjects assessed twice were included in the data analysis. The people not included in the data analysis were demographically similar to the overall sample, and we did not see a systematic reason for them to have not returned for the follow-up testing. Participants were not familiar with the CANTAB tests. None of the participants were diagnosed with Attention-Deficit/Hyperactivity Disorder, learning disability or used psychotropic medication. The participants (60% men) had a mean age of 32.21 years (SD = 13.10) with a mean level of education of 13.97 years (SD = 2.44, range: 10 to 18). The Regional Committees for Medical and Health Research Ethics (REC Central) approved this project and all participants gave informed consent.

Materials and procedures

All participants were assessed twice over a three-month period (M = 3.10 months, SD = 0.37, range: 1.92–4.32) with the same CANTAB tests, administered in the same order. This test-retest interval was used because this study is part of a larger observational cohort study investigating cognitive function following MTBI in adults, and testing three months after injury is a commonly used time point in MTBI research (Iverson, Karr, Gardner, Silverberg, & Terry, Citation2019). Well-trained research staff with bachelor or master level education in clinical psychology or neuroscience administrated the tests. All staff members were under supervision by a licensed clinical psychologist. Psychiatric disease was assessed with the Mini-International Neuropsychiatric Interview (Sheehan et al., Citation1998) administered by a clinical psychologist or medical doctor.

CANTAB

The CANTABeclipse™ version 5.0.0 was used (Cambridge Cognition, Citation2012). Fourteen outcome measures from nine tests were included in the assessment procedure. Three tests were assumed to measure visual learning and memory (Cambridge Cognition, Citation2012). The Paired Associates Learning (PAL) task presents participants with several white boxes that contain different patterns. Each pattern is subsequently revealed for one second and the participants must remember which box contains which pattern. The test was run in clinical mode and total errors adjusted for the number of trials was chosen as the outcome measure. A higher score is indicative of worse performance. The Pattern Recognition Memory (PRM) test presents participants with two different series of 12 patterns. Participants are then required to identify previously seen patterns among novel patterns immediately after the presentation (the first series) and after a 20-min delay (second series). The test was run in clinical mode and percent of correctly identified patterns for each trial was chosen as the outcome measure for each series. A higher score is indicative of better performance. The Spatial Recognition Memory (SRM) test presents the participants with a sequence of five white boxes appearing at various positions on the screen, and the participants must remember the screen placement for each of the boxes. The test was run in clinical mode and percent correct was chosen as the outcome measure. A higher score is indicative of better performance.

Four tests were assumed to measure executive function (Cambridge Cognition, Citation2012). In the Stockings of Cambridge (SOC) test, participants are shown two displays with three balls presented inside stockings, and the aim is to move the balls in the lower display such that it is identical to the arrangement of balls in the upper display. The test was run in clinical mode. The outcome measure was minimum number of possible moves, reflecting the sum of problems solved with the minimum number of possible moves. A higher score is indicative of better performance. In the Attention Switching Task (AST), participants are to determine the side or direction of an arrow on the screen. The arrow varies with respect to placement (right or left) and direction (right or left). The test was run in touch screen mode and three outcome measures were chosen. The first outcome measure, referred to as congruency cost, is the difference in mean response time in milliseconds on congruent (placement and direction are the same) and incongruent (placement and direction are not the same) trials. A positive score indicates that the participant is faster on congruent trials and a negative score indicates that the participant is faster on incongruent trials. The second outcome measure, switch cost, is the difference in mean response time in milliseconds on switch (where the current trial type and the previous trial type are the same, i.e., direction-direction or side-side) versus non-switch trials. A positive score indicates that the participant is faster on non-switch trials, and a negative score indicates that the participant is faster on switch trials. The third outcome measure is the percent of correct trials for both congruent and incongruent trials. A higher score (i.e., greater percent correct) is indicative of better performance. The Spatial Working Memory (SWM) test requires participants to search through boxes for a designated number of tokens. A token is never hidden in the same box twice; and to avoid errors, participants must remember where tokens originally appeared. The test was run in clinical mode and two outcome measures were chosen. The first outcome measure, between errors, is defined as the number of times the participant revisits a box in which a token has previously been found. A higher score is indicative of worse performance. The second outcome measure quantifies the effectiveness of the participant’s strategy. This is a measure of the ability to follow a predetermined sequence beginning with a specific box and then to return to that box to start a new sequence once a blue token has been found. The minimum strategy score is 8 and the maximum is 56. A higher score is indicative of worse performance. The Spatial Span (SSP) test presents participants with multiple white boxes that change color one by one, and participants are asked to tap the boxes in the same order as they change color. The test was run in clinical mode and maximum span length (i.e., longest sequence) was chosen as the outcome measure. A higher score is indicative of better performance.

Two tests were assumed to measure visual attention (Cambridge Cognition, Citation2012). In the Rapid Visual Processing test (RVP), participants are presented numbers from 2 to 9 appearing inside a white box one at a time with a rate of 100 presentations per minute. The participants must press a button on a response box each time they see one of three target sequences (e.g., 2-4-6, 4-6-8, and 3-5-7). The test was run in clinical mode. A prime (A′) is a measure of the ability to detect the target sequence and is the relationship between the probability of identifying a target sequence and the probability of identifying a non-target sequence. It ranges from .00 to 1.00 and a higher score is indicative of better performance. In the Reaction Time (RTI) test, the participant is to respond as fast as possible when a yellow dot is presented inside a circle (simple reaction time) and in one of five white circles (five-choice reaction time). The test was run in clinical mode and response time in milliseconds for each condition was chosen as the outcome measure. A higher score is indicative of worse performance.

Statistical analysis

All statistical analyses were conducted in R (R Core Team, Citation2017) using base R and relevant packages (compute.es: Del Re, Citation2014; psych: Revelle, Citation2016; rsq: Zhang, Citation2018). Raw test scores were used for all analyses because CANTAB only provides normalized scores for a small subset of all available tests and outcome measures. All participants successfully completed all CANTAB tests.

Several of the outcome measures violated the normality assumption with outliers present for most measures. Hence, differences in test scores between sessions were evaluated with Wilcoxon signed-rank test. Effect sizes were calculated using an unbiased Cohen’s d (Hedges g: Hedges & Olkin, Citation1985) and Cliff’s delta (Cliff, Citation1996). For Hedges g, an effect size ≤.20 was considered negligible, an effect size .21–.49 was considered small, an effect size .50–.79 was considered medium, and an effect sizes ≥.80 was considered large (Cohen, Citation1992). For Cliff’s delta, an effect size ≤.15 was considered negligible, an effect size .16–.33 was considered small, an effect size .34–.47 was considered medium, and an effect size ≥.47 was considered large (Romano, Kromrey, Coraggio, & Skowronek, Citation2006). Test-retest reliability was calculated with both Pearson product-moment correlation coefficients (r) and Spearman’s rank correlation coefficients (ρ). The level for acceptable test-retest reliability was defined as ≥.75, in accordance with previously recommended reliability levels using the CANTAB (Lowe & Rabbitt, Citation1998). The standard error of measurement (SEM) for each session was calculated as follows: SEM=SD1r where SD is the standard deviation from the session and r is the test-retest Pearson’s product-moment correlation coefficient. RCIs were calculated based on the standard error of difference (SEdiff), calculated according to Iverson (Citation2001): SEdiff= SEM12+ SEM22 where SEM1 and SEM2 are the SEM from the first and second sessions, respectively. Each confidence interval (CI) was calculated by multiplying the SEdiff with a specific z-score (i.e., 80% CI: z = 1.28 and 90% CI: z = 1.64). For all outcome measures, the mean practice effects [i.e., Mean Time 2 (T2) – Mean Time 1 (T1)] were added to the lower and upper bounds of the CI for the RCI (Chelune et al., Citation1993).

Regression-based change formulas (SRBs) using multiple regression equations were developed, in which scores from the first session (T1) were placed into a linear regression equation with scores from the second session (T2) as the dependent variable and age, gender, and education as covariates. Insignificant predictors (p > .05) were removed with stepwise regression using backwards selection. Predictors were removed in the following order: sex, education, and age. Of note, for all models, the mean of the residuals was approximately zero and equal residual variance was present. Variance inflation factors were low (<2) for all covariates in all models. Durbina-Watson test did not show autocorrelation of residuals and all covariates and residuals were uncorrelated. However, deviations from normality for the residuals, as well as outliers and influential cases were seen in several models. Predicted T2 scores were subtracted from the obtained T2 scores and divided by the standard error of the estimate (SEE). The calculation of the SRB results in a z-score. A z-score of ± 1.65 was chosen as the demarcation point for reliable change, indicating that 10% (i.e., 5% at each tail of the curve) of change scores will fall beyond this cutoff.

In addition to the RCI and SRB approaches to determining reliable change, the natural distribution of change scores (T2 – T1) for determining decline or improvement on the CANTAB is presented in . Unlike the RCI and SRB methods, this approach makes no assumption about normality of the data, rather providing raw values of change scores that fell below or above a specific cumulative percentage of our sample.

Results

Mean scores for the first and second sessions are provided for each outcome measure in . Statistically significant differences (α = .05) in test scores between sessions were seen for AST percent correct, RVP A′, SWM strategy, PAL total errors adjusted, SWM between error, and PRM delayed recall. Improved performance from session 1 to session 2 was seen on all measures, with effect sizes ranging from negligible to medium. The largest practice effects were seen for AST percent correct (g = .40, delta = .39) and RVP A′ (g = .37, delta = .28). Pearson’s product-moment correlation coefficients above the cutoff level for acceptability of ≥.75 (Lowe & Rabbitt, Citation1998) were obtained only for SWM strategy, AST percent correct, and RVP A′. Only SWM strategy and SWM between errors had a Spearman’s rank correlation coefficient >.75.

Table 1. Test-retest data for the study sample.

shows mean difference scores, SEMs for each session, SEdiff, and RCIs with and without adjustment for practice effects. Large changes in test scores were required for reliable change for all outcome measures, ranging from one SD of the T1 score for AST percent correct to nearly two SDs of the T1 score for SRM percent correct (See ). shows the results from the regression equations. The F, R2, SEE, unstandardized beta weights, and the constant for each outcome measure are provided in . All F-tests were significant (p < .001), indicating that the regression models provided a better fit than the intercept-only model. Age and education were only significant predictors in some of the models. Across CANTAB tests, the models accounted for between 19% and 67% of the variance (adjusted R2). Partial adjusted R2 values are provided for all significant predictors. provides change scores at the 5th, 10th, 90th, and 95th percentiles of the natural distribution of change scores for our sample.

Table 2. Mean difference score and reliable change estimates for CANTAB outcome measures.

Table 3. Regression equations for CANTAB outcome measures.

Table 4. Interpreting change on the CANTAB based on the natural distribution of difference scores (Time 2–Time 1).

Discussion

This study presents three-month test-retest data, as well as reliable change indices and regression-based formulas for several outcome measures from the CANTAB, thereby extending the current literature and facilitating the use of the CANTAB in clinical practice. Practice effects were seen for several outcome measures, a finding consistent with existing literature across tests from different cognitive domains (Calamia, Markon, & Tranel, Citation2012). Acceptable test-retest correlations of r ≥ .75 (Lowe & Rabbitt, Citation1998) were obtained for only SWM between errors, AST percent correct, and RVP A′; and only SWM strategy and SWM between errors had Spearman’s correlation coefficients of ρ > .75. Thus, the findings are consistent with previous studies (Cacciamani et al., Citation2018; Goncalves et al., Citation2016; Lowe & Rabbitt, Citation1998; Syvaoja et al., Citation2015), demonstrating low to medium reliability coefficients for the majority of CANTAB tests. Consistent with prior research studies in adults (Cacciamani et al., Citation2018; Goncalves et al., Citation2016; Lowe & Rabbitt, Citation1998), inadequate test-retest reliability was demonstrated for PAL total errors adjusted, PRM delayed recall, SRM percent correct, SSP span length, RTI simple and five-choice reaction time, and SOC minimum number of possible moves. Our finding of adequate test-retest reliability for SWM between errors and strategy, as well as RVP A′, is somewhat surprising, and is not consistent with prior research studies (Cacciamani et al., Citation2018; Goncalves et al., Citation2016). However, this inconsistency may be explained by the fact that these studies have included older adults, some with cognitive impairment, which is known to affect test-retest reliability (Calamia, Markon, & Tranel, Citation2013; Duff, Citation2012).

Low test-retest reliability is common in neuropsychology, and the reliability coefficients obtained for the CANTAB are similar to those associated with commonly used neuropsychological test batteries, such as the Delis-Kaplan Executive Function System (D-KEFS; Delis, Kaplan & Kramer, Citation2001), Wechsler Memory Scale, Third Edition (WMS-III; Wechsler, Citation1997), and Neuropsychological Assessment Battery (NAB; Stern & White, Citation2003). A common theme in psychometric research is that memory and executive functions are difficult to assess in a reliable manner (Calamia et al., Citation2013; Strauss, Sherman, & Spreen, Citation2006). Some authors have suggested that excellent tests of executive functions will inevitably have low temporal stability because these tests, by design, require novelty (Rabbitt, Lowe, & Shilling, Citation2001). Furthermore, it is reasonable to assume that successful performance on memory tests is, at least partially, dependent on executive functions, such as working memory and strategic approaches to learning; thus, affecting the temporal stability of memory tests. In addition, the memory tests used in our study exposed participants to the same information twice, thusly affecting test-retest reliability. Regardless, low reliability limits a test’s utility for diagnostic purposes and its usefulness for detecting change over time (Strauss et al., Citation2006).

We developed reliable change indices and regression-based change formulas for 14 outcome measures from 9 CANTAB tests. The measurement error surrounding difference scores indicated that relatively large changes in performance were needed to interpret a change as reliable, ranging from one SD of the T1 score for AST percent correct to nearly two SDs of the T1 score for SRM percent correct. Consistent with previous research on healthy adults (Attix et al., Citation2009; Duff et al., Citation2010; Duff et al., Citation2004; Duff et al., Citation2005; Sánchez-Benavides et al., Citation2016; Temkin et al., Citation1999), test scores from initial testing significantly predicted retest scores for all outcome measures. Furthermore, age was a significant predictor in many tests across all neuropsychological domains, including AST congruency cost and switch cost, PRM delayed recall, SRM percent correct, SSP span length, SWM between errors and strategy, RTI simple and five-choice reaction time, and SOC minimum number of possible moves. Education contributed significantly only in one test of visual memory (SRM percent correct) and one test of attention (RTI five-choice reaction time). These findings are inconsistent with the study by Goncalves et al. (Citation2016), which found the best fit when excluding age and education from the regression models. However, these differences may be explained by the small sample size utilized by Goncalves et al. (Citation2016), as studies with larger samples consistently have shown effects of both age and education on tests across multiple cognitive domains (Duff, Citation2012). Our finding of an adjusted R2 ranging between .19 and .67 indicated that additional variance in retest scores was unexplained by the different regression models. However, proportions of explained variance for the CANTAB tests were similar to findings from research on healthy adults using other neuropsychological tests measuring a broad range of cognitive domains (Attix et al., Citation2009; Duff et al., Citation2010; Duff et al., Citation2004; Duff et al., Citation2005; Sánchez-Benavides et al., Citation2016; Temkin et al., Citation1999). To illustrate the clinical use of reliable change indices, regression-based change formulas, and cutoffs from our observed change score distribution, we present a fictional case example in the Appendix.

Currently, consensus is lacking on the best method for evaluating reliable change (Hinton-Bayre, Citation2016). We chose to supplement the more traditional RCI approach with the SRB methodology because it takes into account several elements of variability (Chelune et al., Citation1993; Iverson, Citation2001). However, when comparing different methods, research has often produced similar results (Barr & McCrea, Citation2001; Heaton et al., Citation2001; Hinton-Bayre, Citation2012; Maassen, Bossema, & Brand, Citation2009). In addition, considered non-normality of our data, we provided raw percentiles from our observed distributions, through which a clinician could make a comparisons as to where a change score would fall compared to others within our sample. By using the data provided in this paper, clinicians have the opportunity to choose the methods most suited for their particular clinical situation, whether they wish to adjust for practice effects, consider age and education, or make normality assumptions in their determination of change.

Although our results have applications for use of the CANTAB, our study design does include limitations that researchers and clinicians should consider when translating our findings into their research designs or clinical approach. Reliability coefficients are influenced by many different factors such as the age and health of participants, as well as the length of the test-retest interval (Calamia, Markon, & Tranel, Citation2012). Since we applied a three-month test-retest interval and an age range from 16 to 60 years, our results may not be generalizable to assessments with longer or shorter test-retest intervals, or to patients and participants outside the age range of our sample (i.e., pediatric or geriatric populations). The three-month interval was chosen as the study was part of a larger study investigating cognitive functioning following MTBI in adults. A shorter test-retest interval may be more appropriate for some tests that may be re-administered multiple times over the course of recovery following an MTBI; and a longer test-retest interval may be more appropriate for other tests, if they are more often re-administered with longer intervals in clinical practice. However, the magnitude of the test-retest correlation has been shown to decrease with increasing time interval (Duff, Citation2012).

Furthermore, the sample size in our study is relatively small, which may affect the accuracy of our results. However, the sample size is comparable to other studies on the reliability of the CANTAB (Cacciamani et al., Citation2018; Goncalves et al., Citation2016; Syvaoja et al., Citation2015). We did not recruit participants directly from the community but used a convenience sampling approach to recruit hospital and university staff, as well as families and friends of staff and patients with MTBI. The mean education level in our sample was also fairly high (i.e., 14 years), which limits the application of our findings to participant of lower education levels. Thus, the generalizability of our results would be informed through replication with larger and more diverse samples of participants and through further studies on the CANTAB using different test-retest intervals. Another limitation in our study design is that we did not administer performance validity tests. However, none of the participants were involved in litigation and there were no other known external incentives.

Of note, our findings evidence significant limitations as to the reliability of CANTAB test scores, and we made judgements about the inadequacy of test-retest reliability based on a selected cutoff of ≥.75. Although we selected this cutoff, no universally accepted cutoff exists for defining adequate reliability. In the present study, we chose to describe reliabilities according to the labels used by Lowe and Rabbitt (Citation1998), but if we had chosen a lower cutoff for adequate reliability, such as .70 (Strauss et al., Citation2006), the outcome measures of AST switch cost, PAL total errors adjusted, and RTI five-choice reaction time would have been classified as acceptable. However, the calculations used to determine reliable change would not change.

A final limitation pertains to non-normality of our data. The calculation of RCIs and regression formulas for determining reliable change make certain assumptions concerning the properties of our data. We chose to approach the determination of change in three ways, including a simple description of cutoffs in our distribution that makes no assumption of normality. There are many scores that may be administered repeatedly in research or clinical practice, which often, by design, present with non-normal distributions, because they either occur infrequently (e.g., errors), or have lower bound limits to performance (e.g., reaction time). As computerized tasks such as the CANTAB become more common in clinical practice, researchers may need to develop more sophisticated methods for interpreting individual change on tests with non-normal distributions that consider important aspects related to test performance (e.g., retest effects, age, education, etc.). Furthermore, neuropsychologists frequently evaluate patients on more than two time points, and it is unlikely that the results from this study can be used to investigate change between a second and a third time point. Future research should investigate change over multiple assessment sessions using the methods from this paper, as well as utilizing other statistical methods such as latent curve modeling (Duff, Citation2012).

In summary, the results of this study have implications for those who use the CANTAB in research and clinical practice. Practice effects were seen for several outcome measures, with AST percent correct and RVP A′ demonstrating the largest effect sizes. Acceptable levels of test-retest reliability were only seen SWM between errors and strategy, AST percent correct, and RVP A′. Thus, the probable range of measurement error surrounding most test-retest difference scores is large for the CANTAB, meaning that large changes in performance are needed before a clinician or researcher can conclude with confidence that the observed change is not due to measurement error. The results from this paper allow neuropsychologists to consider these factors and make probabilistic statements about change using reliable change indices, standardized regression equations, and the distribution of change scores.

References

Appendix

Case example

To illustrate the clinical use of reliable change indices, regression-based change formulas, and cutoffs from our observed change score distribution, we present a fictional case example of a 25-year-old man with 15 years of education who had a traumatic brain injury of moderate severity. The patient is tested three and 6 months following injury with RTI five-choice reaction time. Mean reaction time was 430 ms at 3-month testing and 370 ms at 6-month testing.

Reliable change index

This 60 ms decrease in reaction time is above the cutoff of 54 ms from the RCI 90% CI after adjusting for practice effects (see ), indicating that a reliable change has occurred. In the absence of a reference table, or if the clinician is interested in using a different confidence interval, the calculation can also be done manually with the following formula: RCI=T2T1(M2M1)SEdiff where the mean practice effect (M2–M1 = −1.83) is subtracted from the difference score for the individual (T2–T1 = 60) and divided by the standard error of the difference (31.82). This calculation results in a z-value of −1.8, which is below the −1.65 demarcation point, approximately at the third percentile. If a more stringent criterion of z ± 1.96 (i.e., 95% confidence interval) is used, the change is not interpreted as reliable.

Regression based change formula

Using this approach, the first step is to calculate the predicted retest score: T2=b1T1+b2Age+ b3Education+ c where T ′2 is the predicted retest score, b1 is the regression slope for initial testing, T1 is the score from initial testing, and c the regression intercept. As age and education were significant predictors in the model for the RTI five-choice reaction time (see ), the regression slope for age (b2) and education in years (b3) is included in the equation.

Using the information provided in the example (observed T1 and T2 test scores, age, and education) in combination with the data in (regression slopes and the intercept), the predicted retest score for RTI five-choice reaction time would be T2= .56 × 430+1.32 × 253.00 × 15+137= 366

The predicted retest score is then tested as follows RCISRB=T2T2SEE where SEE is the standard error of the estimate of the regression equation. The resulting value [(370 − 366)/24.23 = 0.17] is then compared with a normal distribution table, and ± 1.64 is used as the cutoff for defining reliable change. The value is within the ±1.64 interval, indicating that the decrease in mean five-choice of 60 ms does not reflect a reliable change.

Absolute differences based on the distribution of change scores

A final method for evaluating change would use cutoffs from the distribution of change scores presented in . One can see that the 60 ms decrease in reaction time is below the fifth percentile, indicating that the improvement has a low likelihood of happening by chance, because less than 5% of our observed sample obtained such a change score.