2,279
Views
5
CrossRef citations to date
0
Altmetric
CLINICAL ISSUES

Methodological considerations when establishing reliable and valid normative data: Canadian Longitudinal Study on Aging (CLSA) neuropsychological battery

, , , , , , ORCID Icon, , , & show all
Pages 2168-2187 | Received 19 Jan 2021, Accepted 23 Jun 2021, Published online: 02 Sep 2021

Abstract

Objective: Creation of normative data with regression corrections for demographic covariates reduces risk of small cell sizes compared with traditional normative approaches. We explored whether methods of correcting for demographic covariates (e.g., full regression models versus hybrid models with stratification and regression) and choice of covariates (i.e., correcting for age with or without sex and/or education correction) impacted reliability and validity of normative data. Method: Measurement invariance for sex and education was explored in a brief telephone-administered cognitive battery from the Canadian Longitudinal Study on Aging (CLSA; after excluding persons with neurological conditions N = 12,350 responded in English and N = 1,760 in French). Results: Measurement invariance was supported in hybrid normative models where different age-based regression models were created for groups based on sex and education level. Measurement invariance was not supported in full regression models where age, sex, and education were simultaneous predictors. Evidence for reliability was demonstrated by precision defined as the 95% inter-percentile range of the 5th percentile. Precision was higher for full regression models than for hybrid models but with negligible differences in precision for the larger English sample. Conclusions: We present normative data for a remotely administered brief neuropsychological battery that best mitigates measurement bias and are precise. In the smaller French speaking sample, only one model reduced measurement bias, but its estimates were less precise, underscoring the need for large sample sizes when creating normative data. The resulting normative data are appended in a syntax file.

Creation of comparison standards (sometimes referred to as ‘normative data’) is fundamental to the interpretation of neuropsychological test scores. Comparisons with normative data are necessary for determining whether a person’s performance is below the range of healthy cognitive performance (i.e., impaired). In particular, exploration of cognition in an aging population without adjustment for variables associated with healthy aging will produce misleading results due to measurement bias; specifically advanced age results in misclassifications of cognitive impairment when age corrections are not applied (O’Connell et al., 2017). In the Canadian Longitudinal Study on Aging (CLSA), several modifications to neuropsychological standardization were necessary; specifically, use of a brief version of the battery adapted for telephone administration and modification of the memory measure to a single exposure trial with a 5-minute delay to accommodate time limitations and mode of data collection (Tuokko et al., Citation2017). Consequently, new normative data were required to facilitate interpretation of cognitive status for participants in the CLSA. Methods to compare different normative approaches are lacking, and normative data are frequently presented without any support for their measurement properties (e.g., Heaton et al., Citation1999; Roelofs et al., Citation2013; Van der Elst et al., Citation2006, Citation2011, Citation2012) provide normative data without evaluating their measurement properties. Regression-based covariate correction for normative models and stratification for covariate correction for normative models (Dowling et al., Citation2010) both have evidence suggesting they can reduce measurement bias, but we are unaware of any manuscript that compares mitigation for measurement bias across different normative models within the same dataset. In the current paper we explore different models for creation of normative data and present the evidence for validity, defined as reducing the measurement bias associated with the demographic covariates of age, education and sex. In addition, for each normative model we explore the evidence for reliability, defined as the amount of error associated with the estimated percentile rank that results from using each normative model as a comparison standard.

Core to the creation of all normative comparison standards is the selection of persons for whom cognitive status is likely within normal limits. For large epidemiological studies such as the CLSA, this necessitates the exclusion of persons who report neurological conditions that could impact cognition. Then, using a healthy sample, one critical consideration in creating normative comparison standards is whether and how to correct for sources of measurement bias or covariates associated with healthy aging that are known to impact cognition. Specifically, lower cognitive performance can be due to advancing age or fewer years of education, and these factors should be accounted for before interpreting a person’s cognitive status.

A second consideration in the creation of normative comparison standards is the choice of method to make the necessary adjustments for covariates, such as age, to remove measurement bias. In the CLSA sample, O’Connell et al. (2017) found that age and medical conditions were associated with cognitive performance, but only age was associated with an increased likelihood of misclassification of cognitive impairment; consequently, we created normative data corrected for age but not for medical conditions. Typically, corrections for possible measurement bias in cognition occur for age, sex, and years of formal education.

Measurement bias has been demonstrated for neuropsychological data, and the variables of years of formal education (Brewster et al., Citation2014); sex (Maitland et al., Citation2000); and sex, age, and education (Bertola et al., Citation2020; Rawlings et al., Citation2016) can bias measurement of neuropsychological constructs. Numerous statistical methods explore measurement bias, but these are underused in neuropsychology (Pedraza & Mungas, Citation2008). Measurement invariance is one method for determining measurement bias based on subgroups, which was an ideal approach for this study given the battery approach to assessment of the latent constructs of memory and executive function. Four levels of invariance have been defined (Meredith & Teresi, Citation2006; Putnick & Bornstein, Citation2016; Vandenberg & Lance, Citation2000; Wu et al., Citation2007), and from weakest to strongest these are: configural (same indicators load on the same factors for the subgroups), weak/metric (configural plus equality in factor loadings for subgroups), strong/scalar (weak/metric plus factor mean equalities for subgroups), and strict (strong/scalar plus equal residuals for subgroups). While many researchers report weak invariance as sufficient for measurement invariance (for a review of measurement invariance in psychology, see Vandenberg & Lance, Citation2000), others argue strong and strict invariance is necessary to ensure lack of measurement bias (Meredith & Teresi, Citation2006; Wu et al., Citation2007). Bias in normative data can be removed with covariate corrections, and measurement invariance has demonstrated the removal of bias across variables such as age, education, and sex in normative data created using regression models and stratification (Dowling et al., Citation2010).

There are numerous approaches to removing measurement bias with covariate corrections while creating normative models. Traditional methods for creating normative comparison standards calculate the means and standard deviations of the cognitive test scores for groups stratified by age, years of formal education, and/or sex. In this traditional approach, the group means and standard deviations are used to calculate an individual’s z-score, relative to his/her group, to compare his/her performance and/or determine impairment. This traditional method based on stratified-sample z-scores has been criticized as problematic because z-score estimates when used with norms based on samples containing fewer than 50 persons can overestimate how rare, or extreme, a person’s score is relative to the normative standard (Crawford & Howell, Citation1998). Clinically, this could result in spurious interpretation of cognitive impairment. Although large normative samples might appear robust to this problem, when samples are stratified by sex, age group, and years of education, the resulting cell sizes used as the basis for the normative group can be quite small (Crawford & Howell, Citation1998). In a simulation study, Oosterhuis et al. (Citation2016) suggested that an overall sample size of 10,000 was needed to create normative data with the traditional approach. With this overall sample size, Oosterhuis et al found that stratifying based on demographic variables such as age, education, or other demographic variables resulted in good normative data based on equal-sized normative groups of 1,250. However, in empirical studies, groups are rarely equal, nor this large. For example, even in the CLSA where the normative sample (described in more detail later) exceeded 10,000, only 49 English-speaking women were between the ages of 45 and 55 and had a formal education level of less than high school.

Oosterhuis et al. (Citation2016) found that for regression-based approaches, a non-stratified sample size of 1,000 was sufficient to create precise normative data. Due to known problems with small normative group sizes, even in what appear to be relatively large normative samples, regression-based approaches to normative data creation for all types of measures are becoming more common (e.g., Berrigan et al., Citation2014; Heaton et al., Citation1999; Roelofs et al., Citation2013; Van Breukelen & Vlaeyen, Citation2005; Van der Elst et al., Citation2006, Citation2011, Citation2012). Briefly, regression-based approaches to creating normative comparison standards involve two basic steps (Van Breukelen & Vlaeyen, Citation2005): estimating a best-fitting regression model by regressing the test scores onto the participants’ covariates (e.g., age in years, years of education, sex); and using the best-fitting model to create a distribution of standardized residual scores (i.e., the difference between each observed score and the score predicted from the regression model, divided by the standard deviation of the residual scores). The final distribution of standardized residuals is then used as the comparison standard to assess a client’s impairment status, by using the estimated best-fitting regression model to obtain the client’s standardized residual score and his/her standing in the distribution in terms of percentile rank.

Regardless of how they are created, all normative models have accompanying error (Crawford et al., Citation2009). Recently, methods have been proposed to develop confidence intervals (CIs) for percentile ranks derived from normative comparison standards as indicators of the amount of uncertainty in the person’s standing (i.e., percentile rank) on the distribution of scores in the normative sample (Crawford et al., Citation2011; Oosterhuis et al., Citation2017; Voncken et al., Citation2019). There is error in using a normative sample to estimate a person’s ranking in the normative population, and this error or uncertainty in the estimate is reflected in the width of the CI (Crawford et al., Citation2011, p. 13).

Oosterhuis et al. (Citation2016) found that precision differed more at scores corresponding with extreme percentiles, such as the 2nd or 1st percentile, but these are rarely used clinically. For the current paper we chose to use the 5th percentile because more lenient cut-off scores for impairment, such as the 8th or even 16th percentiles lead to spurious conclusions about cognitive impairment when used across a battery of tests (e.g., Crawford et al., Citation2007). In contrast, the 5th percentile, we argue, balances spurious impairment with clinical utility of interpretation of individual test scores. We created separate comparison standards for neuropsychological tests administered in French and in English, a necessary step until evidence of measurement invariance is established because translation of tests across languages can impact measurement of cognition (e.g., Pedraza & Mungas, Citation2008).

For each cognitive test we created regression-based and hybrid regression and stratification for normative models that consider the categorical sex variable, the skewness of the categorical education variable, and deviation from linearity of test scores with age in some education sub-groups. Multi-group confirmatory factor analysis (MG-CFA) was used to compare the different normative models on how well they removed bias for the categorical variables of sex and education level. Then we used the precision (based on bootstrapped variability in the score associated with the 5th percentile) to compare the estimated regression models with and without stratification for demographic covariates. We were also able to compare normative models based on regression models that included only stronger predictors. We postulated that including all three demographic covariates, including potentially extraneous covariates, could result in a ‘specification error,’ increasing the confidence interval of the estimated regression coefficients (e.g., Pedhazur, Citation1997) and impacting the precision of the normative data. To identify potentially extraneous covariates, we chose only those demographic variables with a bivariate association with the test scores of approximately .20, as proposed by Crawford et al. (Citation2011).

Method

Participants

The CLSA is a large, national, long-term study that will follow for at least 20 years more than 50,000 men and women aged 45 to 85 at enrolment. In CLSA, at baseline, 21,241 participants were randomly selected from the 10 Canadian provinces and provided questionnaire data through telephone interviews (i.e., Tracking cohort). Another 30,097 participants were randomly selected from areas extending 25-50 km from each of 11 Data Collection Sites located across Canada, and provided data through an in-home interview and a visit to a Data Collection Site (i.e., Comprehensive cohort). This paper includes data from the Tracking (telephone interview) cohort.

To control standardization cognitive testing was completed from central locations by computer assisted telephone interviews by highly trained interviewers. The word list for the memory test was delivered by recording, and responses for all cognitive tests were audio recorded with scoring completed centrally to ensure standardization. Sociodemographic variables such as age, sex, language of test administration, and educational attainment (collected as a categorical variable) were obtained from the full set of variables collected (Raina et al., Citation2009, 2019). From the initial sample of 21,241; 1,826 were excluded due to having one or more of the following self-reported neurological conditions: transient ischemic attack (n = 748, 3.50% of the sample); stroke (n = 390, 1.80%); memory problems (n = 449, 2.10%); dementia or Alzheimer’s (n = 43, 0.20%); Parkinsonism or Parkinson’s Disease (n = 78, 0.40%); multiple sclerosis (n = 141, 0.70%); epilepsy (n = 166, 0.80%); and those who clarified in their response to ‘other’ cancers, reported cancers of the nervous system (n = 21, 0.10%); and self-reported concussion injuries in the past 12 months (n = 49, 0.20%). Of the remaining 19,415 participants, 4,549 were excluded due to missing data in one or more of the following categories: missing years of formal education (n = 84, 0.40%); no consent to record responses to neuropsychological battery (n = 140, 0.80%); memory test missing or could not be scored from the recording (n = 1,767, 8.30%); generative fluency test missing or could not be scored from the recording (n = 799, 3.80%), or was administered incorrectly due to prompting (n = 491, 2.30%); complex attention test missing or could not be scored from the recording (n = 2,406, 11.30%), or was given an impossible score of 0 (n = 748, 4.00%). Of the remaining 14,855 participants, 745 (5.00%) were excluded because they switched back and forth between English and French either across the test battery or within tests (i.e., the test start language and response language were incongruent). The final normative sample included 14,110 participants. Of this sample, 12,350 completed the tests in English and 1,760 completed the tests in French. includes the demographic information for each of these normative samples.

Table 1. Descriptive statistics for the English- and French-speaking participants without self-reported neurological conditions and with complete neuropsychological data.

Measures

For each of the aforementioned chronic conditions (i.e., a transient ischemic attack, stroke, memory problems, dementia or Alzheimer’s disease, Parkinson’s disease, multiple sclerosis, epilepsy, cancer) participants were asked “Has a doctor ever told you that you have ___?” Participants were asked to report only conditions that lasted, or were expected to last, at least 6 months and were diagnosed by a health professional. Only participants who responded “no” to all conditions were selected for the present analyses (i.e., participants who did not know, did not respond, or refused to respond were excluded, as were participants who responded “yes”).

During the development to the CLSA, cognitive measures were identified through literature reviews and chosen based on their psychometric properties (e.g., sensitivity, specificity, reliability, and responsiveness), their availability in French and English, their relevance to those aged 45-85, and time and cost of administration (Tuokko et al., Citation2017). Participants in the Tracking cohort completed three neuropsychological tests administered via telephone: a modified the Rey Auditory Verbal Learning Test (REY) (Rey, Citation1964), the Mental Alternation Test (MAT) (Teng, Citation1994), and Animal Fluency test (AFT) (Goodglass & Kaplan, Citation1983). The original RAVLT asks participants to remember a list of 15 words (Rey, Citation1964), but the CLSA version, called the REY, was modified in two ways: 1) it was administered over the telephone, facilitated by a recording to ensure standardized timing of the list; and 2) it consisted of a single exposure trial with immediate recall (REY I) and a 5-minute delay with free recall (REY II). In the animal fluency task (AFT) participants are asked to name as many animals as possible in 60 s. Two scoring algorithms were used to score the AFT: strict (AFT 1), where scoring allots points only for taxonomically distinct animals that differ at the level of species (Tuokko et al., Citation2017) and lenient (AFT 2), where scoring was consistent with scoring rules for semantic fluency in the Delis-Kaplan Executive Function System (Delis et al., Citation2001) where credit can be given for each distinct animal named. The MAT (Teng, Citation1994) is a verbal analog of the Trail Making Test, where participants must verbally alternate letters and numbers in ascending order. The score is the number of consecutive correct responses in 30 seconds.

Statistical analyses for creating the normative standards

Regression model estimation for individual cognitive tests

For each cognitive test, a number of regression models were estimated. Model estimation was conducted within stratified subgroups (stratified by sex, education level, and/or age grouped into decades) as well as with the stratification variables included as covariates in the regression model using the full sample. The variety of regression models we estimated was dictated in part by the failure of the CLSA data to meet the assumptions for regression analyses, namely, homoscedasticity, particularly relevant when looking at the tail of the distribution at the estimate of the 5th percentile of the test scores in both English- and French-speaking samples and linearity between the test scores and age in the English-speaking sample. Also, education level was highly negatively skewed (70+% were university graduates), so we stratified by education level as well as using education level as a “continuous” predictor in the regression models.

In the English-speaking sample, the relationship of some cognitive test scores with age (REY I, REY II, and the MAT), for age ranging from 45 to 85 years, was nonlinear (particularly in the highest education group). Test scores decreased with age fairly linearly for 45 to about 75 years of age, but then dropped at an accelerated rate from 76 to 85 years of age. In order to capture this nonlinearity in the resulting normative standards we created age sub-groups to estimate the relationship with age for the 75 to 85-year-old participants separately from the others (we refer to this model as “piecewise linear”).

For both English- and French-speaking samples, the following six regression models were estimated for each cognitive test:

  • regression models with age (in years) regressed onto the test scores, within groups stratified for sex and education group;

  • regression models with age (in years), with and without education level (four categories), regressed onto the test scores, within groups stratified by sex;

  • regression models with age (in years), with and without sex, regressed onto the test scores, within groups stratified by education level; and

  • regression models in non-stratified samples,

    • with age (in years), education, and sex as predictors,

    • with age (in years) and either sex or education if its bivariate association (r point biserial, rpb; or Spearman, rs, as appropriate) was .20 or greater with the cognitive test score (i.e., only “stronger” predictors), and

    • with age (in years) only.

In the English-speaking sample, stratifying by sex and education group (to account for the skew in the distribution of education) revealed a non-linear relationship with age within the resultant subgroups for the REY I, REY II, and MAT. Therefore, the English-speaking sample was further stratified into four 10-year age groups (i.e., 45-54, 55-64, 65-74 and 75+), and an additional regression model using piecewise linear regressions was estimated within each sex, education level, and 10-year age-group for these three cognitive tests. This stratification also resulted in smaller cell sizes, which allowed us to also examine the effects of sample size on the precision of the estimates.

Model comparisons using measurement invariance

Standardized residuals from the various regression models were used as the basis for all precision comparisons for simplicity (i.e., standardized residuals are easily saved during regression analyses) and because any subsequent transformations (e.g., into test score units or scaled scores) were linear transformations with constants and would, therefore, impact the residuals equally.

Measurement invariance was explored using multi-group confirmatory factor analysis (MG-CFA). Theory (Tuokko et al., Citation2017) and the intercorrelation matrices supported a 2-factor solution for the telephone-adminstered CLSA battery: a memory factor and an executive function factor – with fit indices not sensitive to large samples suggesting good fitting models [English χ2 = 2.29; df = 6; CFI = 1.00; RMSEA = 0.01; French χ2 = 2436.90; df = 6; CFI = 1.00; RMSEA = 0.00]. MG-CFA involves testing models with increasing constraints for subgroups of participants. For each normative model in English and in French, MG-CFA was conducted once for subgroups based on sex and once for subgroups based on education using R package lavaan (v. 0.6-5; Rossel, 2012) in R (R Core Team, Citation2017), but we overrode the marker variable default in lavaan and used the reference group method. In MG-CFA, nested models are compared, initially to the unconstrained model and then to each step measuring configural, weak/metric, strong/scalar, and strict invariance. Configural invariance is demonstrated when the number of factors and the items that load on each factor is the same for each subgroup (e.g., for measurement invariance across sex, the 2 factors are loaded with the same variables for men as for women). If configural invariance is supported, weak invariance is tested where the factor loadings for each variable are constrained to be equal across subgroups. Strong invariance includes constraining the intercepts as well. Finally, strict invariance requires that not only the factor loadings and intercepts are equal, but also the residuals are equal across subgroups (Meredith & Teresi, Citation2006).

The level of measurement invariance supported by each of the estimated regression models in this study was assessed by comparing the successive goodness of fit indices of the nested MG-CFA models. There are many model fit indices to choose from, and we chose three. Little (Citation2013) reviewed the evidence for various fit indices and concluded that a change in comparative fit index (CFI) had the strongest evidence base (including from simulation studies) for model comparisons. From this body of work, a Δ CFI of < 0.01 between two models appears to indicate that models fit the data equally well; we used this as a criterion for comparing the MG-CFA models to determine the strongest level of measurement invariance supported by each regression model. For our second fit index, we followed the recommendations of Putnick and Bornstein (Citation2016), who suggested a Root Mean Square Error of Approximation (RMSEA) Δ < 0.015 had the strongest evidence. For the third, we report a commonly reported fit index Δ χ2, although it has been criticized by Little (Citation2013) as too sensitive to trivial differences in large sample sizes.

Model comparisons using precision of the 5th percentile

Precision of the 5th percentile obtained from each of the regression-based normative models was estimated using bootstrapping programmed in R (R Core Team, Citation2017). Precision was conceptualized as the 95% inter-percentile range (IPR) of scores corresponding to the 5th percentile. One thousand bootstrapped samples (i.e., random samples with replacement) from the French- and separately the English-speaking normative samples were generated. For each bootstrapped sample the test score corresponding with the 5th percentile was determined and used to construct a sampling distribution of the estimate of the 5th percentile. The IPR was then obtained from this generated sampling distribution (of the 1,000 estimates of the 5th percentile “statistic”) as the difference between the standardized residual corresponding with the 97.50th and 2.50th percentile (i.e., the 975th and 25th ranked estimates) of this sampling distribution of 1,000. The IPR thus gives a 95% confidence level for the estimate of the 5th percentile estimate, with a smaller IPR indicating higher precision (Oosterhuis et al., Citation2016, p. 196).

Creating normative standard data for each cognitive test: the final step

For each cognitive test, the “best” model was chosen for creating the normative data. Ideally the best model would have the highest level of measurement invariance and have the most precise (smallest IPR) estimate of the 5th percentile. We obtained the standardized residuals from this model, and transformed them into the normative data (i.e., normed test scores in the original test score units) using the weighted mean and standard deviation of the original test scores from the normative sample. The weights were based on the sampling procedures from the Canadian population (https://www.clsa-elcv.ca/doc/1041), which was previously explored for use in creation of normative data (O’Connell et al., Citation2019). Sampling weights allow extrapolation of observations from the sample to the population from which the sample was obtained by inflating the observations in the sample to the level in the population, thus minimizing the impact of sampling bias. To transform scores into standard scores the overall weighted mean and standard deviation for each test score in the French- and English-speaking samples were used in order to avoid a double correction with stratification, which could occur if stratified weighted means and standard deviations were used.

Results

As can be seen in , the French- and English-speaking samples were similar in demographic variables (i.e., the effect sizes were small to trivial). It should also be noted that the CLSA normative sample is highly educated. includes overall summary scores for each of these normative samples for the cognitive measures described in the current study. The mean differences in raw test scores between English and French were trivial, as indicated by the small effect size in the cases where the two groups differed statistically (due to the large sample sizes). This finding hints that the way cognition is measured on these cognitive tests might be similar in the two languages, but evidence of measurement invariance to language of administration is needed. includes information about the distribution of test scores, and although the scores for the two memory measures were approaching positive skew and leptokurtosis, they were below proposed thresholds of |1.00|; the small standard errors suggest statistically significant deviation from normality (skewness and/or kurtosis) for all test scores.

Table 2. Overall performance on the cognitive tests.

Table 3. Skewness and kurtosis of the raw score distributions of the cognitive tests in the Tracking English- and French-speaking samples.

The bivariate associations between demographic variables and the cognitive test scores are presented in . What is apparent from this table is the different patterns of associations for different demographic variables with each cognitive test within the two language samples. This finding provides further support for creating normative data for the English and French speaking samples separately.

Table 4. Magnitude of bivariate associations for test scores and demographic variables in the normative Tracking sample by language of administration.

The exploration of how well the different models account for measurement bias is presented in . It is clear from this table that not all normative models remove measurement bias. Although this finding is not surprising for the more parsimonious models that did not correct for all variables (i.e., the age only model), it was an unexpected finding for the full regression model that accounts for all demographic variables. It appears that a hybrid regression and stratification model consistently performed best at reducing bias based on sex and education level.

Table 5. Level of measurement invariance (MI) for different regression models tested.

Precision of the 5th percentile of the distribution of standardized residuals from the different regression-based normative models for each cognitive test are shown in and for the English- and French-speaking samples, respectively. The Tables also show, in standardized residual units, the width of the 95% IPR which is the estimate’s precision that we propose to use for selecting the best model for deriving the comparative normative data. We also present the lower and upper bounds of the 95 IPR (i.e., the 95% confidence interval for the 5th percentile), and the mean of the sampling distribution corresponding to the point estimate of the 5th percentile cut-off (again, in standardized residual scores).Footnote1 Two observations can be gleaned from these IPR results. First, the differences in precision in the 5th percentile estimate across the different regression models were very small. In the English-speaking sample, the difference in the width of the 95 IPR between the best and worst models (in terms of precision) ranged from 0.00 for AFT 1 to 0.07 for REY II. For the French-speaking sample, the differences were overall slightly larger, ranging from 0.05 for AFT 1 to 0.12 for the REY II.

Table 6. English-Speaking Sample: 95 inter-percentile range (IPR) for the 5th Percentile Estimate (for Standardized Residuals)Table Footnote1.

Table 7. French-Speaking Sample: : 95 inter-percentile range (IPR) for the 5th Percentile Estimates (for Standardized Residuals).Table Footnote1

Second, we note that regression-based approaches that addressed the linearity assumption (the first column of for the English-speaking sample) produced the most precise estimates for two of the three cognitive measures, REY I and MAT. We postulated that stratifying the English-speaking sample into 10-year age groups, in order to account for the non-linearity, may lead to reduced precision. Our results indicate, however, that precision was not substantially reduced despite the sample sizes within these finely stratified samples being considerably reduced (smallest N among the 32 groups stratified by sex (2 levels) x education level (4 levels) and age group (4 levels) was N = 49).

Discussion

Using the real-world data from the CLSA and models that adjust for demographic variable covariates in different ways, our findings indicate that the choice of how this adjustment is done affects the validity and reliability of the resulting normative data. Although the purpose of covariate corrections within normative data is to mitigate measurement bias due to covariates such as sex and education, the present results demonstrate that not all methods to correct for covariates in normative models reduce this measurement bias. This finding has clinical implications because few authors or test developers have compared normative data; instead, normative data are frequently provided without any critical appraisal. Full regression models with all three covariates as predictors did not reduce measurement bias. Overall, the best performing model across both languages and all cognitive tests was a hybrid approach with stratification for categorical variables (sex, and in CLSA educational attainment) with age as a continuous predictor. In the English sample, strict measurement invariance (MI) was met for sex and education stratification with age as a continuous predictor; in the French sample, this was the only model that converged for MI of education groups, albeit only at the level of strong MI. Nonetheless, strict and strong MI are considered adequate evidence for lack of measurement bias (Meredith & Teresi, Citation2006; Wu et al., Citation2007).

The MI results were an unexpected finding, because we anticipated the full regression normative model would mitigate measurement bias. It could be argued that the regression model was insufficiently specified with only the main effects of sex and education level and that for example an interaction effect should have been explored. However, it was also surprising that the model created to mitigate failures of the assumptions underlying a regression analysis in the English sample, specifically non-linearity (and to a lesser extent heteroscedasticity of variance and non-normality of the outcome variable) also failed to reduce measurement bias. This finding expands on the results reported by Oosterhuis et al. (Citation2016) who cautioned that their simulation study did not reveal how well regression-based normative models performed when underlying assumptions were violated. As noted, the CLSA data violate several assumptions for regression analyses: some of the cognitive test scores have moderate positive skew; education is highly negatively skewed; and age is slightly non-linearly related to the cognitive tests scores for three of the cognitive tests in the English-speaking samples. We would have predicted that the normative regression model that met all assumptions should have had the strongest evidence for validity, perhaps at the cost of reliability (precision) due to smaller cell sizes. However, the superiority of the hybrid sex and education stratified age-regression model was unexpected.

We explored evidence for reliability using the Oosterhuis et al. (Citation2016) approach comparing the precision of the “cut-off” point, which we defined as the test score corresponding to the 5th percentile. The differences in precision for the different models in the English sample were negligible, and the non-linear regression stratified model’s precision was similar to the full regression’s precision across all tests, underscoring the importance of sample size in estimates of precision. Once converted to scores on the original cognitive tests (using the weighed M and the SD for the original scales) these small differences in the standardized residuals became a difference of 1.10 points for the MAT, 0.75 for AF2, and 0.62 for REY II in the French-speaking sample. A single point of uncertainty when making a clinical decision about the cut-off for impairment at the 5th percentile can have implications for a clinical assessment of borderline versus impaired performance. We postulate that differences in precision would be more pronounced if the normative sample had been even smaller in size than N = 1,760. It could be argued that these data are somehow unique to the French-speaking sample, but in our prior work (O’Connell et al., Citation2019) randomly sampling progressive smaller samples from the English-speaking group we demonstrated progressively less precision, indicating that precision is clearly dependent on sample size. Precision, however, interacts with the choice of cut-off for impairment. Oosterhuis et al. (Citation2016) found that differences in precision were more pronounced for extreme cut-offs for impairment, for example at the 2nd percentile and should be less pronounced if one used 1.5 SD (or the 7th percentile).

We postulated that focusing only on predictors with stronger associations with the criterion, rather than including all, even weakly or trivially associated, demographic variables, might lead to more precise regression-based normative models (e.g., Pedhazur, Citation1997). Based on the precision of the 5th percentile estimate of the various models we tested, we conclude that the model with only the “stronger” predictors, defined by us here to be those with a correlation with the cognitive test scores of .20 or greater, did not produce the most precise estimates, but nor did they produce the worst. Typically, this “stronger” model tended to fall in the middle of the pack. At the same time, including all three predictors, whether or not they met the “strong predictor” inclusion criterion also did not produce the most or least precise estimates. The challenge here is that there are no empirical guidelines regarding what an important association might be and when we should correct for covariates. Although Cohen (Citation1988) provided labels for the magnitude of effect sizes, such as r = .30 as a medium effect, he stated that researchers had to gather data to determine the importance of any particular magnitude of effect in their research area. There is no consensus regarding the magnitude of bivariate association necessary to prompt correction as covariates within normative models. Our results suggest that corrections for demographic and cognitive associations that are small in magnitude do not necessarily reduce precision. Future research, ideally with a simulation study, might illuminate the magnitude of association important for correction of covariates within a normative model.

The major contribution of this paper is methodological. We have demonstrated that the method of controlling for variables known to impact cognitive test scores within the regression model framework has implications for validity, but not for reliability if the sample is sufficiently large. In the smaller 1,760 French sample, however, the differences in precision were more noticeable and different models varied by up to 1 test score in precision. Precision communicates how well the relative rankings derived from a normative sample estimate or represent the relative rankings from a normative population, and our results indicate that more caution is warranted in clinical interpretation of impairment for the French MAT.

This work has implications for future test developers and for creation of normative data in general. We argue that test developers may wish to create a variety of regression-based models from their normative sample using different methods for covariate correction, and compare the evidence for reliability and validity of different normative models. The final normative model provided for clinical use can thus be the normative model for which there is the strongest evidence for validity and for precision, and clinicians can be confident that interpretations based on the relative ranking of their client’s score relative to the normative population is free of bias and is minimal in error when the normative data are based on large samples. Finally, we have appended the SPSS syntax that calculates the normative data for this brief battery of tests in an appendixFootnote2.

Acknowledgements

This research was made possible using the data collected by the Canadian Longitudinal Study on Aging (CLSA). Funding for the Canadian Longitudinal Study on Aging (CLSA) is provided by the Government of Canada through the Canadian Institutes of Health Research (CIHR) under grant reference: LSA 94473 and the Canada Foundation for Innovation. This research has been conducted using the CLSA dataset [Baseline Tracking Data set version 3.0], under Application Number [180001S]. The opinions expressed in this manuscript are the author’s own and do not reflect the views of the Canadian Longitudinal Study on Aging. The preparation of this manuscript was partially supported by funding provided by Alzheimer Society of Canada/Alzheimer Société de Canada and the Pacific Alzheimer Research Foundation. Support was provided for the second author (LG) by the McLaughlin Foundation Professorship in Population and Public Health. PR holds Tier 1 Canada Research Chair in Geroscience and Labarge Chair in Optimal Aging. We would like to acknowledge Dr. Todd Little for his StatsCamp training on MG-CFA. We would like to acknowledge one of MEO’s PhD students, Jake Ursenbach for his help creating part of the R code and Dr. Audrey J. Leroux, Ph.D., who checked this R code during R consulting at StatsCamp.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1 The sampling distribution of the 1000 bootstrapped estimates of the 5th percentile should be Normal, due to the Central Limit Theorem, therefore the midpoint of the interval should represent the point estimate of the mean of this sampling distribution, in other words the point estimate of the 5th percentile score.

2 Since the time of the model exploration done in this manuscript CLSA updated their database. The normative data are based on the updated database, where some participants had their scores changed but the sampling weights have changed. The models reported in the current paper do not use the sampling weights.

References

  • Berrigan, L. I., Fisk, J. D., Walker, L. A., Wojtowicz, M., Rees, L. M., Freedman, M. S., & Marrie, R. A. (2014). Reliability of regression-based normative data for the oral symbol digit modalities test: An evaluation of demographic influences, construct validity, and impairment classification rates in multiple sclerosis samples. The Clinical Neuropsychologist, 28(2), 281–299. https://doi.org/10.1080/13854046.2013.871337
  • Bertola, L., Benseñor, I. M., Barreto, S. M., Moreno, A. B., Griep, R. H., Viana, M. C., Lotufo, P. A., & Suemoto, C. K. (2020). Measurement Invariance of Neuropsychological Tests Across Different Sociodemographic Backgrounds in the Brazilian Longitudinal Study of Adult Health (ELSA-Brasil). Neuropsychology, 34(2), 227–234. https://doi.org/10.1037/neu0000597
  • Brewster, P. W. H., Tuokko, H., & MacDonald, S. W. S. (2014). Measurement equivalence of neuropsychological tests across education levels in older adults. Journal of Clinical and Experimental Neuropsychology, 36(10), 1042–1054. https://doi.org/10.1080/13803395.2014.967661
  • Cohen, J. (1988). The effect size index: d. Statistical Power Analysis for the Behavioral Sciences, 2, 284–288.
  • Crawford, J. R., Cayley, C., Lovibond, P. F., Wilson, P. H., & Hartley, C. (2011). Percentile Norms and Accompanying Interval Estimates from an Australian General Adult Population Sample for Self-Report Mood Scales (BAI, BDI, CRSD, CES-D, DASS, DASS-21, STAI-X, STAI-Y, SRDS, and SRAS). Australian Psychologist, 46(1), 3–14. https://doi.org/10.1111/j.1742-9544.2010.00003.x
  • Crawford, J. R., Garthwaite, P. H., & Gault, C. B. (2007). Estimating the percentage of the population with abnormally low scores (or abnormally large score differences) on standardized neuropsychological test batteries: A generic method with applications. Neuropsychology, 21(4), 419–430. https://doi.org/10.1037/0894-4105.21.4.419
  • Crawford, J. R., Garthwaite, P. H., & Slick, D. J. (2009). On percentile norms in neuropsychology: Proposed reporting standards and methods for quantifying the uncertainty over the percentile ranks of test scores. The Clinical Neuropsychologist, 23(7), 1173–1195. https://doi.org/10.1080/13854040902795018
  • Crawford, J. R., & Howell, D. C. (1998). Comparing an individual’s test score against norms derived from small samples. The Clinical Neuropsychologist, 12(4), 482–486. https://doi.org/10.1076/clin.12.4.482.7241
  • Delis, D. C., Kaplan, E., & Kramer, J. H. (2001). Delis-Kaplan Executive Function System®(D-KEFS®): Examiner’s manual: Flexibility of thinking, concept formation, problem solving, planning, creativity, impluse control, inhibition. Pearson.
  • Dowling, N. M., Hermann, B., La Rue, A., & Sager, M. A. (2010). Latent structure and factorial invariance of a neuropsychological test battery for the study of preclinical Alzheimer’s disease. Neuropsychology, 24(6), 742–756. https://doi.org/10.1037/a0020176
  • Goodglass, H., & Kaplan, E. (1983). The assessment of aphasia and related disorders (2nd ed.). Lea & Febinger.
  • Heaton, R. K., Avitable, N., Grant, I., & Matthews, C. G. (1999). Further crossvalidation of regression-based neuropsychological norms with an update for the Boston Naming Test. Journal of Clinical and Experimental Neuropsychology, 21(4), 572–582. https://doi.org/10.1076/jcen.21.4.572.882
  • Little, T. D. (2013). Longitudinal structual equation modeling (methodology in the social sciences). Guilford.
  • Maitland, S. B., Intrieri, R. C., Schaie, K. W., & Willis, S. L. (2000). Gender differences and changes in cognitive abilities across the adult life span. Aging, Neuropsychology and Cognition, 7(1), 32–53. https://doi.org/10.1076/anec.7.1.32.807
  • Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical Care, 44(11 Suppl 3), S69–S77. https://doi.org/10.1097/01.mlr.0000245438.73837.89
  • O’Connell, M. E., Tuokko, H., Kadlec, H., Griffith, L. E., Simard, M., Taler, V., Voll, S., Thompson, M. E., Panyavin, I., & Wolfson, C. (2019). Normative comparison standards for measures of cognition in the Canadian Longitudinal Study on Aging (CLSA): Does applying sample weights make a difference?Psychological Assessment, 31(9), 1081–1091. vhttps://doi.org/10.1037/pas0000730
  • O’Connell, M. E., Tuokko, H., Voll, S., Simard, M., Griffith, L. E., Taler, V., Wolfson, C., Kirkland, S., & Raina, P. (2017). An evidence-based approach to the creation of normative data: Base rates of impaired scores within a brief neuropsychological battery argue for age corrections, but against corrections for medical conditions. The Clinical Neuropsychologist, 31(6–7), 1188–1203. https://doi.org/10.1080/13854046.2017.1349931
  • Oosterhuis, H. E., van der Ark, L. A., & Sijtsma, K. (2016). Sample size requirements for traditional and regression-based norms. Assessment, 23(2), 191–202. https://doi.org/10.1177/1073191115580638
  • Oosterhuis, H. E., van der Ark, L. A., & Sijtsma, K. (2017). Standard errors and confidence intervals of norm statistics for educational and psychological tests. Psychometrika, 82(3), 559–588. https://doi.org/10.1007/s11336-016-9535-8
  • Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd Ed.). Wadsworth.
  • Pedraza, O., & Mungas, D. (2008). Measurement in cross-cultural neuropsychology. Neuropsychology Review, 18(3), 184–193. https://doi.org/10.1007/s11065-008-9067-9
  • Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review : DR, 41, 71–90. https://doi.org/10.1016/j.dr.2016.06.004
  • R Core Team. (2017). R: A language and environment for statistical computing.https://www.R-project.org
  • Raina, P., Wolfson, C., Kirkland, S., Griffith, L. E., Balion, C., Cossette, B., Dionne, I., Hofer, S., Hogan, D., van den Heuvel, E. R., Liu-Ambrose, T., Menec, V., Mugford, G., Patterson, C., Payette, H., Richards, B., Shannon, H., Sheets, D., Taler, V., … Young, L. (2019). Cohort profile: The Canadian Longitudinal Study on Aging (CLSA). International Journal of Epidemiology, 48(6), 1752–1753. https://doi.org/10.1093/ije/dyz173
  • Raina, P. S., Wolfson, C., Kirkland, S. A., Griffith, L. E., Oremus, M., Patterson, C., Tuokko, H., Penning, M., Balion, C. M., Hogan, D., Wister, A., Payette, H., Shannon, H., & Brazil, K. (2009). The Canadian longitudinal study on aging (CLSA). Canadian Journal on Aging = La Revue Canadienne du Vieillissement, 28(3), 221–229. https://doi.org/10.1017/S0714980809990055
  • Rawlings, A. M., Bandeen-Roche, K., Gross, A. L., Gottesman, R. F., Coker, L. H., Penman, A. D., Sharrett, A. R., & Mosley, T. H. (2016). Factor structure of the ARIC-NCS Neuropsychological Battery: An evaluation of invariance across vascular factors and demographic characteristics. Psychological Assessment, 28(12), 1674–1683. https://doi.org/10.1037/pas0000293
  • Rey, A. (1964). L’examen clinique en psychologie [Clinical tests in psychology]. Presses Universitaires de France.
  • Roelofs, J., van Breukelen, G., de Graaf, L. E., Beck, A. T., Arntz, A., & Huibers, M. J. (2013). Norms for the Beck Depression Inventory (BDI-II) in a large Dutch community sample. Journal of Psychopathology and Behavioral Assessment, 35(1), 93–98. https://doi.org/10.1007/s10862-012-9309-2
  • Rosseel, Y. (2012). lavaan: An R package for structural equation modelling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02
  • Teng, E. L. (1994). The Mental Alternation Test (MAT). Department of Neurology, University of Southern California School of Medicine.
  • Tuokko, H., Griffith, L. E., Simard, M., & Taler, V. (2017). Cognitive measures in the Canadian Longitudinal Study on Aging. The Clinical Neuropsychologist, 31(1), 233–250. https://doi.org/10.1080/13854046.2016.1254279
  • Van Breukelen, G. J., & Vlaeyen, J. W. (2005). Norming clinical questionnaires with multiple regression: The Pain Cognition List. Psychological Assessment, 17(3), 336–344. https://doi.org/10.1037/1040-3590.17.3.336
  • Van der Elst, W., Hoogenhout, E. M., Dixon, R. A., De Groot, R. H., & Jolles, J. (2011). The Dutch Memory Compensation Questionnaire: Psychometric properties and regression-based norms. Assessment, 18(4), 517–529. https://doi.org/10.1177/1073191110370116
  • Van der Elst, W., Ouwehand, C., van der Werf, G., Kuyper, H., Lee, N., & Jolles, J. (2012). The Amsterdam Executive Function Inventory (AEFI): Psychometric properties and demographically corrected normative data for adolescents aged between 15 and 18 years. Journal of Clinical and Experimental Neuropsychology, 34(2), 160–171. https://doi.org/10.1080/13803395.2011.625353
  • Van der Elst, W., Van Boxtel, M. P., Van Breukelen, G. J., & Jolles, J. (2006). Normative data for the Animal, Profession and Letter M Naming verbal fluency tests for Dutch speaking participants and the effects of age, education, and sex. Journal of the International Neuropsychological Society, 12(1), 80–89. https://doi.org/10.1017/S1355617706060115
  • Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. https://doi.org/10.1177/109442810031002
  • Voncken, L., Albers, C. J., & Timmerman, M. E. (2019). Improving confidence intervals for normed test scores: Include uncertainty due to sampling variability. Behavior Research Methods, 51(2), 826–839. https://doi.org/10.3758/s13428-018-1122-8
  • Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor analysis: A demonstration with TIMSS data. Practical Assessment, Research and Evaluation, 12(3), 1–26.