2,305
Views
17
CrossRef citations to date
0
Altmetric
Original Articles

Normative comparisons for large neuropsychological test batteries: User-friendly and sensitive solutions to minimize familywise false positives

, , , &
Pages 611-629 | Received 27 Mar 2015, Accepted 11 Dec 2015, Published online: 10 Apr 2016

ABSTRACT

Introduction. In neuropsychological research and clinical practice, a large battery of tests is often administered to determine whether an individual deviates from the norm. We formulate three criteria for such large battery normative comparisons. First, familywise false-positive error rate (i.e., the complement of specificity) should be controlled at, or below, a prespecified level. Second, sensitivity to detect genuine deviations from the norm should be high. Third, the comparisons should be easy enough for routine application, not only in research, but also in clinical practice. Here we show that these criteria are satisfied for current procedures used to assess an overall deviation from the norm—that is, a deviation given all test results. However, we also show that these criteria are not satisfied for current procedures used to assess test-specific deviations, which are required, for example, to investigate dissociations in a test profile. We therefore propose several new procedures to assess such test-specific deviations. These new procedures are expected to satisfy all three criteria. Method. In Monte Carlo simulations and in an applied example pertaining to Parkinson disease, we compare current procedures to assess test-specific deviations (uncorrected and Bonferroni normative comparisons) to new procedures (Holm, one-step resampling, and step-down resampling normative comparisons). Results. The new procedures are shown to: (a) control familywise false-positive error rate, whereas uncorrected comparisons do not; (b) have higher sensitivity than Bonferroni corrected comparisons, where especially step-down resampling is favorable in this respect; (c) be user-friendly as they are implemented in a user-friendly normative comparisons website, and as the required normative data are provided by a database. Conclusion. These new normative comparisons procedures, especially step-down resampling, are valuable additional tools to assess test-specific deviations from the norm in large test batteries.

In neuropsychological assessment, an individual is often administered a large battery of tests (e.g., Arenas-Pinto et al., Citation2014; Binder, Iverson, & Brooks, Citation2009; Brooks, Citation2010; Crawford, Garthwaite, & Gault, Citation2007; Larrabee, Citation2014; Schretlen, Testa, Winicki, Pearlson, & Gordon, Citation2008; S. J. Wilson et al., Citation2015). The score on each test is then compared to its normative data, to assess deviations from the norm. This paper addresses the question of how to perform such large battery normative comparisons in a valid and easy way, thereby facilitating routine application in neuropsychological research and in neuropsychological practice.

Large battery comparisons are ubiquitous. In clinical practice, they are used to inform diagnosis and/or guide tailored treatment (Lezak, Howieson, Bigler, & Tranel, Citation2012). In research, they serve two purposes. First, they may be used to classify participants into impaired versus nonimpaired groups. These groups are then studied to investigate prevalence, demographic factors, biomarkers, or treatment effects (e.g., Meyer, Boscardin, Kwasa, & Price, Citation2013). Second, the classification into impaired versus unimpaired serves in some treatment effect studies as a dependent variable. That is, treatment effects are assessed not only in a continuous fashion—that is, whether a mean memory score improves under a new treatment as compared to treatment as usual—but also in a discrete manner—that is, whether the percentage of participants with a memory impairment reduced under a new treatment as compared to treatment as usual (cf. Kazdin, Citation2008; Kraemer & Kupfer, Citation2006).

Adequate procedures for normative comparisons of a single test have already been proposed (Crawford & Howell, Citation1998). These procedures have been extended in various ways—for example, to yield effect sizes and confidence intervals (Crawford & Garthwaite, Citation2002) and to account for background variables like an individual’s age or level of education (Crawford & Garthwaite, Citation2006). In such single test normative comparisons, a test score falling below a percentile criterion of the normative data is considered to be abnormal. For example, the percentile criterion may be set at 5%. This 5% criterion implies that the false-positive error rate—that is, the chances of deciding that an individual deviates from the norm whereas she or he actually does not—is 5%.Footnote1

In case of large test batteries, the 5th percentile criterion implies that the false-positive error rate is 5% for each test separately, corresponding to a specificity of 95%. These false-positive errors accumulate when multiple tests are administered, yielding an overall false-positive error rate, the familywise false-positive error rate, which will exceed 5%. More specifically, if M tests are administered, the familywise false-positive error rate, from now on the familywise error, is [1– (1 – 0.05)M] × 100%, provided that tests are uncorrelated in the normative sample. For example, the familywise error for M = 13 uncorrelated tests is then 49%. That is, a healthy individual has a 50–50 chance to be classified as deviating on at least one test (cf. Huizenga, Smeding, Grasman, & Schmand, Citation2007). Although this familywise error will be lower if tests are correlated in the normative sample, it will often substantially exceed 5% (Crawford et al., Citation2007; Huizenga et al., Citation2007).

There is an increasing awareness in the neuropsychological community that it is necessary to control familywise error at prespecified levels. This awareness is present in the group means testing context, where it is, for example, tested whether group means differ (two-sample t tests) or whether group means differ from a hypothesized value (one-sample t tests) on multiple neuropsychological tests. In such a group means testing context, it has been argued that a lack of control over familywise error may give rise to overinterpretation of chance findings (e.g., Bell, Olivier, & King, Citation2013; Blakesley et al., Citation2009; Eichstaedt, Kovatch, & Maroof, Citation2013; Levav et al., Citation2002; Lewis, Maruff, Silbert, Evered, & Scott, Citation2006; Schatz, Jay, McComb, & McLaughlin, Citation2005; C. E. Wilson et al., Citation2014; cf. Ioannidis, Citation2005; Miguel et al., Citation2014; Simmons, Nelson, & Simonsohn, Citation2011). In an excellent review specifically aimed at the neuropsychological community, Blakesley et al. (Citation2009) reviewed several procedures to control familywise error in the group means testing context. They studied, for example, the well-known Bonferroni procedure, the Holm procedure (Holm, Citation1979), and various resampling procedures (Westfall & Young, Citation1993). Simulation studies indicated that these alternative procedures all controlled familywise error.

The familywise error issue also is prominent in the normative comparisons context (e.g., Berthelson, Mulchan, Odland, Miller, & Mittenberg, Citation2013; Bilder, Sugar, & Hellemann, Citation2014; Brooks, Citation2010; Crawford et al., Citation2007; Davis & Millis, Citation2014; Larrabee, Citation2008, Citation2014; Loewenstein et al., Citation2006; Meyers et al., Citation2014; Naglieri & Paolitto, 2010; Palmer, Boone, Lesser, & Wohl, Citation1998; Proto et al., Citation2014; Schretlen et al., Citation2008). It has been argued that in clinical practice, lack of control over familywise error in normative comparisons may result in overdiagnosis and unnecessary treatment, increasing patient burden and unnecessary costs to the health care system (Binder et al., Citation2009; Brooks, Iverson, Holdnack, & Feldman, Citation2008; Gisslén, Price, & Nilsson, Citation2011; Torti, Focà, Cesana, & Lescure, Citation2011). In neuropsychological research, it has been argued that lack of control has two disadvantages. First, if normative comparisons are used to assign participants to impaired and nonimpaired groups, lack of control will lead to the inclusion of false positives into the impaired sample, resulting in heterogeneity, and thus in less powerful studies of, for example, prevalence, risk factors, biomarkers, and treatment effects (Blackford & La Rue, Citation1989; Brooks, Iverson, Feldman, & Holdnack, Citation2009; Höfler, Citation2005; Meyer et al., Citation2013). Second, if normative comparisons are used to assess deviations from the norm after treatment, lack of control may lead to the conclusion that many participants still deviate from the norm, whereas the treatment was actually quite effective. So, we require that procedures for normative comparisons control familywise error at prespecified levels.

We also require that procedures have adequate sensitivity to detect genuine deviations from the norm. Detection of genuine deficits is important in neuropsychological research. First, it offers the opportunity to precisely investigate prevalence and progression of these deficits. Second, it allows identification of all deficits associated with a disorder, thereby offering the opportunity to gain more insight into the mechanisms underlying the disorder (Lezak et al., Citation2012). Detection of genuine deficits is also important in neuropsychological clinical practice, as it offers the opportunity to target interventions to these deficits (e.g., Constantinidou, Wertheimer, Tsanadis, Evans, & Paul, Citation2012; Sander, Nakase-Richardson, Constantinidou, Wertheimer, & Paul, Citation2007).

In addition to these familywise error and sensitivity criteria, we also require that procedures are easy to apply, as they should offer the possibility of routine application in neuropsychological assessment, not only in research but also in clinical practice. Procedures that are not user-friendly because they require a statistical background and programming skills and/or large normative datasets will not be used very often. Therefore, we require that a procedure should be user-friendly.

Before reviewing potential procedures that may satisfy the three criteria, it is informative to make a distinction between two main aims of large battery comparisons (cf. Huberty & Morris, Citation1989). First, large battery comparisons are used to classify individuals as overall impaired or unimpaired given all tests. Second, large battery comparisons are also used to provide test-specific classifications as impaired or unimpaired—for example, to investigate dissociations in the test profile. For example, in test-specific classifications an individual may be classified as impaired on a memory test but as unimpaired on the other neuropsychological tests. In the following we review whether current procedures for overall and test-specific classification satisfy the familywise error, sensitivity, and user-friendliness criteria.

Overall classification as impaired or unimpaired

One procedure for overall classification is to require deviations on multiple tests (e.g., Arenas-Pinto et al., Citation2014; Axelrod & Wall, Citation2007; Grunseit, Perdices, Dunbar, & Cooper, Citation1994; Ingraham & Aiken, Citation1996; Proto et al., Citation2014). Several approaches have been adopted to determine the number of required deviations. Among them, approaches taking the dependency between test scores into account are to be preferred (Berthelson et al., Citation2013; Crawford et al., Citation2007; Muslimovic, Post, Speelman, & Schmand, Citation2005; Schagen, Muller, Boogerd, Mellenbergh, & van Dam, Citation2006; Schretlen et al., Citation2008), as they satisfy the three criteria (e.g., Crawford et al., Citation2007). That is, they control familywise error at prespecified levels, have adequate sensitivity, and are relatively easy to apply as software exists to determine the number of required deviations (Crawford, Citation2016).

A second procedure for overall classification is to perform a multivariate normative comparison (e.g., Cohen et al., Citation2015; González-Redondo et al., Citation2012; Smeding, Speelman, Huizenga, Schuurman, & Schmand, Citation2011; Su et al., Citation2015). In a multivariate comparison, it is determined whether an entire test profile—that is, an individual’s combination of test scores—differs from that in the normative sample (Crawford & Allan, Citation1994; Grasman, Huizenga, & Geurts, Citation2010; Huba, Citation1985; Huizenga et al., Citation2007). This method satisfies the three criteria (Huizenga et al., Citation2007). That is, familywise error is controlled, sensitivity is adequate, and it is easy to apply as the procedure is implemented in a webpage (Multivariate normative comparisons, Citation2016).

In sum, the overall classification procedures satisfy the three criteria. However, this is not the case for current test-specific classification procedures, as we outline next.

Test-specific classification as impaired or unimpaired: Current procedures

The first common procedure for test-specific classifications is to perform uncorrected comparisons—that is, to treat each test as if it was the only test that was administered. As indicated earlier, these uncorrected comparisons do not control familywise error. As a result, sensitivity is very high. The procedure is very user-friendly, as no additional computations are required. So uncorrected comparisons do not satisfy the familywise error criterion, yet they do satisfy the sensitivity and user-friendliness criteria.

The second procedure is a Bonferroni normative comparison (e.g., Huizenga et al., Citation2007). If tests are uncorrelated in the normative sample, this correction yields a familywise error never exceeding 5%. However, if test scores are correlated, which is much more common, Bonferroni correction results in a familywise error that is too low and, consequently, with a decreased sensitivity to detect genuine deviations from the norm (e.g., Huizenga et al., Citation2007). Therefore, Bonferroni normative comparisons satisfy the familywise error criterion, but the sensitivity criterion is not satisfied. The user-friendliness criterion is satisfied, as the procedure is relatively simple to apply.

Test-specific classification as impaired or unimpaired: New procedures

As uncorrected and Bonferroni normative comparisons do not satisfy all criteria, we propose three alternatives: Holm, one-step, and step-down resampling normative comparisons. Below we only indicate whether these procedures are likely to satisfy the three criteria; the procedures are described in more detail in the Method section.

The first new procedure is based on the Holm method (Holm, Citation1979). In the usual group means testing context, it has been shown that Holm controls familywise error. It has also been shown that the Holm method is characterized by higher sensitivity than Bonferroni, although sensitivity is still too low if test scores are correlated (Blakesley et al., Citation2009; Eichstaedt et al., Citation2013; Holm, Citation1979). Up to now the Holm method has only been applied in the group means testing context, but we will show that it can easily be extended to normative comparisons. In order to promote user-friendliness, we implemented Holm normative comparisons in a user-friendly Normative Comparisons website (Agelink van Rentergem & Huizenga, Citation2016).

The second new procedure is based on one-step resampling (Blakesley et al., Citation2009; Nichols & Holmes, Citation2002 for a general introduction; Westfall & Young, Citation1993 for a more specific treatment). In the group means testing context, it has been shown that one-step resampling controls familywise error and outperforms Bonferroni in terms of sensitivity if test scores are correlated. Up to now, one-step resampling has only been applied in the group means testing context, but we will show that it can easily be extended to normative comparisons.

The third new procedure is based on step-down resampling (Westfall & Young, Citation1993). In the mean testing context, it has been shown that step-down resampling controls familywise error and outperforms one-step resampling in terms of sensitivity. We will again show that generalization to the normative comparisons context is easy.

With respect to user-friendliness of the resampling approaches, two important issues deserve attention. First, the resampling normative comparisons procedures require experience with programming, for example in R (R Core Team, Citation2015) and therefore are not user-friendly. To address this, we implemented them in the user-friendly Normative Comparisons website (Agelink van Rentergem & Huizenga, Citation2016). A second issue relates to the fact that resampling normative comparisons require access to raw normative data; means and standard deviations of normative data are not sufficient. Raw normative data are generally available in research settings, as scientific studies often compare patient samples to healthy control samples. However, in neuropsychological practice, raw normative data are usually unavailable. To address this issue, we aggregated healthy control data from neuropsychological scientific studies into a single database. This database will be made available, without any costs, for qualifiedFootnote2 neuropsychologists in the very near future (ANDI; Advanced Neuropsychological Diagnostics Infrastructure, Citation2016). Currently, investigators of 90 studies donated healthy control data of over 25,000 participants together completing over 50 neuropsychological tests. This offers the possibility to provide the normative data required for resampling normative comparisons.

We first outline the new normative comparison procedures in more detail. We then report the results of a Monte Carlo simulation study in which we compared the usual uncorrected and Bonferroni normative comparisons to the new Holm, one-step resampling, and step-down resampling normative comparisons. In these simulations we assess familywise false-positive error and the sensitivity to detect genuine deviations from the norm. We also illustrate the normative comparisons website with an application to the neuropsychological evaluation of patients with Parkinson disease (Muslimovic et al., Citation2005). Finally, we summarize results and discuss potential limitations and solutions.

Method

We first describe a single normative comparison and then proceed with Bonferroni, Holm, one-step resampling, and step-down resampling. More detail and computer code are given in the Appendix.

Normative comparisons: Single neuropsychological test

First, consider a single neuropsychological test used to compare an individual to a normative sample of N persons. Let x denote the score of the individual, and let yn, with n = 1, …, N, denote scores in the normative sample. It is convenient (cf. Appendix) to center normative scores and the individual’s score at the normative sample mean . That is, , and , where * denotes that a variable is centered. The statistic required for a single normative comparison equals (Crawford, Howell, & Garthwaite, Citation1998; Sokal & Rohlf, Citation1995):

(1)

Note that equals zero due to centering. In equation (1), denotes the usual estimate of the standard deviation of :

(2)

The scaling factor equals . To understand why this is the case, suppose first it instead equals 1. In that case, equation (1) is the common one-sample statistic, used to test whether differs from the mean . More specifically, in , is divided by the standard deviation of the mean , that is, by its standard error :

(3)

However, in the current normative comparisons context, we do not aim to test whether deviates from the mean , but to test whether it deviates from the distribution of . Therefore should not be divided by the standard deviation of the mean , but by the standard deviation of the distribution of , that is, by . This is effectuated by setting the scaling factor in equation (1) roughly equal to instead of 1. More precisely it should equal (for an extensive treatment: Sokal & Rohlf, Citation1995, p. 227–228).

Whereas is used to determine whether a value deviates from the mean of a distribution of observations (group means testing context), is used to determine whether a value deviates from a distribution of observations (normative comparisons context). In both contexts, the statistics and have to be compared to the distribution of under the null hypothesis (Crawford et al., Citation1998). This is the Student t distribution with N–1 degrees of freedom. So, if we aim to determine whether a score deviates from the norm, we compare the statistic to the distribution of under the null hypothesis, and the resulting p-value is indicative of the abnormality of . If is located in the outer tails of this distribution, we decide that the score deviates from the norm. The choice of a critical value for the outer tails determines the false-positive rate.Footnote3 For example, in the case of a one-sided normative comparison, testing the hypothesis that an individual scores less than the norm, a critical value of .05 for the lower tail yields a false-positive rate of 5%.

This close resemblance between group means testing and normative comparisons—statistics differ by a scaling factor but the required distribution is the same—allows us to extend procedures from a group means testing context to a normative comparisons context, as is outlined next.

Bonferroni normative comparisons

If a familywise error of 5% is desired and if M neuropsychological tests are administered, the p-values (cf. previous section) of all statistics are multiplied by M. This yields the Bonferroni corrected p-values.

Holm normative comparisons

The Holm procedure (cf. Holm, Citation1979 for the group means testing context) is a so-called step-down version of the Bonferroni procedure. Correction proceeds in two steps: from p-values to step-down p-values, and from step-down p-values to corrected p-values. First, the p-value of the largest absolute statistic is multiplied by M, the second largest by (M – 1), and so on. This yields step-down p-values. Thereafter, a correction is applied, ensuring that smaller absolute t-statistics do not have smaller p-values than larger absolute t-statistics. To accomplish this, the corrected p-value of a statistic is the maximum of its step-down p-value and the corrected p-values of larger absolute statistics.

One-step resampling normative comparisons

In uncorrected comparisons, the statistic is compared to the distribution of under the null hypothesis. In one-step resampling normative comparisons, the absolute statistic is compared to the distribution of the maximum over M absolute statistics under the null hypothesis (cf. Nichols & Holmes, Citation2002; Westfall & Young, Citation1993, for the mean-testing context). Whereas the distribution of under the null hypothesis is known (the Student t distribution), the distribution of the maximum over M absolute statistics, the so-called max distribution, is unknown and therefore has to be obtained by resampling (cf. Nichols & Holmes, Citation2002). That is, by resampling the original dataset it is possible (cf. Appendix) to create a new dataset that satisfies the null hypothesis of no differences between and on any of the M neuropsychological tests. From this new dataset, we determine and store the maximum over its M absolute statistics. This resampling procedure is repeated many—for example, 2000—times, thereby generating 2000 maximum absolute statistics under the null hypothesis and thus the required max distribution (cf. ). After this max distribution has been obtained, each of the M absolute statistics is compared to the max distribution. A more technical description is given in the Appendix.

Figure 1. An illustration of the one-step resampling approach. This figure contains max distributions obtained in a condition where the normative sample consists of 50 participants and where 13 uncorrelated tests (top row) or 13 correlated tests (bottom row) have been administered. In each row, the three figures refer to max distributions derived from 10, 100, and 1000 resamples: It can be seen that smoothness of the distribution increases with an increasing number of resamples. The theoretical Student-t distribution is depicted in the max distribution derived from 1000 resamples. If tests are uncorrelated, Bonferroni and resampling critical values (arrows) are equal. If tests are correlated, the resampling critical value is less stringent.

Figure 1. An illustration of the one-step resampling approach. This figure contains max distributions obtained in a condition where the normative sample consists of 50 participants and where 13 uncorrelated tests (top row) or 13 correlated tests (bottom row) have been administered. In each row, the three figures refer to max distributions derived from 10, 100, and 1000 resamples: It can be seen that smoothness of the distribution increases with an increasing number of resamples. The theoretical Student-t distribution is depicted in the max distribution derived from 1000 resamples. If tests are uncorrelated, Bonferroni and resampling critical values (arrows) are equal. If tests are correlated, the resampling critical value is less stringent.

Step-down resampling normative comparisons

In step-down resampling normative comparisons (Westfall & Young, Citation1993, for the mean-testing context), the largest absolute statistic is compared to the max distribution over all M neuropsychological tests, as in the one-step resampling procedure. However, the next largest absolute statistic is referred to the max distribution computed from all neuropsychological tests except the one giving rise to the largest absolute statistic. The second next largest statistic is referred to the max distribution computed from all neuropsychological tests except the first two, and so on. Afterwards, a correction is applied, ensuring that smaller absolute t-statistics do not have smaller p-values than larger absolute t-statistics, akin to the correction used in the Holm procedure. Please refer to the Appendix for the technical description.

Monte Carlo simulations

The goal of the simulations was to assess familywise error (i.e., the complement of specificity) and sensitivity for uncorrected, Bonferroni, Holm, one-step, and step-down resampling comparisons. In the resampling comparisons we derived the max distributions by computing 2000 resamples.

Simulation method

We simulated multivariate normally distributed data for 50 persons as the normative sample, and data for one individual that was compared to this normative sample (cf. for a similar approach, Crawford & Garthwaite, Citation2006; Huizenga et al., Citation2007). This procedure was repeated 5000 times in each simulation condition. We combined three factors. First, we included conditions with either 10 or 30 neuropsychological tests. Second, we included conditions in which correlations between tests in the normative sample were set to .0, .5, or .8. Third, we simulated a difference from the norm by giving the individual a score of 0, 2, 2.5, 3, 3.5, or 4 standard deviations from the normative data mean. In the case of a difference from the norm, this difference was present on the first five neuropsychological tests. For example, in the case of 30 tests, a difference—for example, of 3 standard deviations—was present on the first five tests, but not on the remaining 25.

The normative comparisons procedures were implemented as outlined in the R-code in the Appendix. In one-step and step-down resampling, we computed 2000 resamples.

An estimate of familywise error was obtained from conditions in which there was no simulated difference between the individual and the normative sample. Familywise error was defined as the percentage of simulations in which one or more of the test results indicated a deviation from the norm. An estimate of sensitivity was obtained from conditions in which there was a simulated difference. Sensitivity was defined as the percentage of simulations in which the individual deviated on the first test.

Simulation results

indicates that familywise error differs markedly between uncorrected comparisons and the other types of comparisons. Uncorrected comparisons are characterized by too high familywise error. In the worst case, in which 30 uncorrelated tests are administered, it is nearly 80% instead of the intended 5%. Although familywise error decreases with the number of tests and with the correlation between them, the most favorable condition—that is, 10 tests that are .8 correlated—still yields a familywise error of about 15%. Bonferroni and Holm comparisons are characterized by a familywise error at or below 5%. One-step and step-down resampling comparisons always have a familywise error of about 5%. So Bonferroni, Holm, and one-step and step-down resampling, but not the usual uncorrected comparisons, keep familywise error at or below 5%.

Table 1. Familywise error rate as a function of the number of neuropsychological tests and correlations between these tests in the normative sample.

Sensitivity is depicted in . Although uncorrected comparisons are characterized by an unacceptably large familywise error, their results are plotted to provide some sort of upper bound to attainable sensitivity. First consider the situation in which test scores are uncorrelated (left-hand panels). In these cases all procedures have equal sensitivity. Second, if variables are correlated (middle and right-hand panels), resampling comparisons are characterized by highest sensitivity, with step-down resampling slightly outperforming one-step resampling.

Figure 2. Sensitivity as a function of correlations in the normative sample and as a function of the magnitude of the simulated differences to the norm.

Figure 2. Sensitivity as a function of correlations in the normative sample and as a function of the magnitude of the simulated differences to the norm.

In sum, among the procedures with an acceptable familywise error, step-down resampling has to be preferred as it has the highest sensitivity. As compared to Bonferroni, a sensitivity advantage up to 20% can be attained.

Illustrative application

Muslimovic et al. (Citation2005) compared the cognitive profile of 115 patients with newly diagnosed Parkinson disease to that of 70 healthy controls. As an illustration we compare each of these patients to the control sample using the common uncorrected and Bonferroni normative comparisons, and the new Holm, one-step resampling, and step-down resampling normative comparisons.

Twenty-three neuropsychological test variables were included in the analysis. Only participants with complete data were included, leaving 84 patients and 65 controls for further analysis. The patient and control samples differed significantly in age; therefore we used scores that were standardized with respect to published norms, or which were standardized by means of a regression approach (for further details on standardization: Muslimovic et al., Citation2005). All normative comparisons were one-sided, because we hypothesized that patients perform worse than the control sample. We required that individual scores were located below the usual 5th percentile—that is, we used the alpha = .05 criterion.

The average correlation between variables was not very high (.15), but some variables correlated in the .6–.9 range (cf. ). Therefore we expected the resampling approaches, as compared to Bonferroni and Holm, to show a higher percentage of deviations.

Figure 3. Histogram of correlations between normative test scores in the empirical illustration.

Figure 3. Histogram of correlations between normative test scores in the empirical illustration.

Uncorrected comparisons reveal that 89% of the newly diagnosed Parkinson patients show a deviation on at least 1 neuropsychological test variable. This percentage is 17% for Bonferroni and Holm and 19% for one step and step-down resampling. Two patients are not classified as deviating with Bonferroni and Holm, but are so with the resampling approaches.

As an illustration, consider how one of these patients, patient 3075, is analyzed with the Normative Comparisons website (). Input options are displayed on the left, whereas output, both in graphical form (, upper panel) and in tabular form ( lower panel), is displayed on the right. With respect to input, we uploaded two datasets, one for controls and one for patients, containing ID numbers and test scores. We also selected the type of normative comparisons: step-down resampling, one-sided comparisons, deciding whether scores are lower than the norm, with the usual alpha = .05 criterion. The graphical output (, upper panel) and matching tabular output (, lower panel) indicates that this patient deviates on the Tower of London test, but not on the other tests.

Figure 4. Illustration of the Normative Comparisons website. To view a color version of this figure, please see the online issue of the Journal.

Figure 4. Illustration of the Normative Comparisons website. To view a color version of this figure, please see the online issue of the Journal.

Discussion

Large battery normative comparisons are ubiquitous in neuropsychological practice and research. Therefore, it is important that these comparisons are carried out in a valid, sensitive, and user-friendly way. First, adequate large battery normative comparisons should control familywise false-positive error rate at a prespecified level in order to guarantee high specificity. Second, they should have sufficient sensitivity to detect genuine deviations from the norm. Third they should be user-friendly to allow routine application in neuropsychological practice and research. We noted that several procedures for overall normative comparisons satisfy these three criteria, but that current standard procedures for test-specific comparisons do not. Therefore, the aim of the current paper was to develop test-specific normative comparisons procedures meeting all three criteria. We compared these new procedures to standard procedures by means of simulations and by means of an empirical example.

Results of our simulation study indicate that traditional uncorrected comparisons do not control familywise false-positive error. In the worst case, a familywise error approaching 80% instead of the intended 5% was observed. Only the Bonferroni, Holm, one-step resampling, and step-down resampling procedures control familywise error at or below 5%. Resampling outperforms Bonferroni in terms of sensitivity, with a slight advantage of step-down resampling over one-step resampling. Our simulations indicate that a sensitivity advantage of up to 20% over Bonferroni can be obtained. Let us suppose that the sensitivity advantage is 10%. This implies that an additional 10 out of 100 individuals will be correctly characterized as deviating from the norm. In neuropsychological practice, these individuals may then, for example, profit from interventions, which otherwise would not be available to them. In neuropsychological research, this heightened sensitivity will offer the opportunity to gain more insight into the mechanisms underlying a disorder (Lezak et al., Citation2012).

The increase in sensitivity as compared to Bonferroni depends on the magnitude of correlations between neuropsychological tests. It is difficult to give a general indication of the sensitivity advantage that is to be expected in neuropsychology, since the magnitude of correlations is unknown in most situations. In our Parkinson example, correlations varied between –.20 and .80, and the average correlation was .15. Although the average correlation was small, resampling methods did classify two individuals as deviating that Bonferroni did not.

Several issues deserve attention. First, we only investigated performance of procedures for normally distributed normative data. Crawford, Garthwaite, Azzalini, Howell, and Laws (Citation2006) indicated that the t-statistic approach, which lies at the heart of uncorrected, Bonferroni, and Holm normative comparisons, is affected by non-normality (cf. Grasman et al., Citation2010). Resampling approaches to mean testing are generally less affected by non-normality than t tests. Therefore, resampling approaches to normative comparisons might also be beneficial in this respect, yet this requires further investigation.

Second, base rates of impairment may vary between patient samples. The current resampling procedures may allow for such base rate information in two ways. First, base rates may be included as priors in a Bayesian approach. Ibrahim, Chen, and Gray (Citation2002) proposed a Bayesian extension of the one-step resampling approach in a group means testing context. An extension to the current normative comparisons context might therefore be feasible. Note, however, that a Bayesian approach is hardly ever used in neuropsychological practice (Elwood, Citation2007; Gavett, Citation2015). Instead, neuropsychologists include base rates informally by using lenient cutoff criteria—for example, by choosing a nominal alpha of 20% instead of 5% (Elwood, Citation2007; Meehl & Rosen, Citation1955). Accordingly, the second way to incorporate base rate information into resampling procedures is to use such lenient cutoff criteria.

Third, and related to the previous point, as compared to the usual uncorrected comparisons, Bonferroni, Holm, one-step resampling, and step-down resampling comparisons are characterized by decreased sensitivity. If high sensitivity is required, we argue that it is better not to use uncorrected comparisons, as this will not provide insight into the actual familywise error. In such circumstances, we suggest using an elevated criterion—for example, to change the required familywise error from 5 to 10 or 20%. For example, if an effective and safe treatment for cognitive impairment would be available, up to 20% false positives might be preferred to minimize the risk that patients are denied access to this effective treatment.

To conclude, the present study indicates that large battery test-specific normative comparisons are best carried out by resampling normative comparisons, especially by step-down resampling comparisons. They control familywise error, they have the highest sensitivity to detect genuine deviations, and they are user-friendly, since the Normative Comparisons website promotes their routine use in neuropsychological research and clinical practice.

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Funding

H.M.H. is supported by a VICI grant awarded by the Netherlands Organization of Scientific Research (NWO) [grant number 453-12-005]. J.A.R. is supported by a grant awarded by the NWO [grant number MaGW 480-12-015]. D.M. was supported by the Prinses Beatrix Fonds [grant number PGO01-0138]. R.P.P.P.G. was supported by a VENI grant awarded by the NWO [grant number C.2523.0079].

Notes

1 Although we adhere in this paper to a required false-positive rate of 5%, other prespecified rates may also be imposed, without loss of generality.

2 In the first year after release of the database, Dutch qualified neuropsychologists will be given access. Qualifications can be checked easily, as every licensed neuropsychologist is registered by the Dutch ministry of health (BIG register, Citation2016). After this first year, international extensions will be considered.

3 Provided that the usual assumptions of a t test are met.

References

  • Advanced Neuropsychological Diagnostics Infrastructure. (2016, January 21). Retrieved from http://www.andi.nl/home/
  • Agelink van Rentergem, J. A., & Huizenga, H. M. (2016, January 21). Retrieved from https://eclip.shinyapps.io/NormativeComparisons
  • Anderson, M. J., & Legendre, P. (1999). An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. Journal of Statistical Computation and Simulation, 62(3), 271–303. doi:10.1080/00949659908811936
  • Arenas-Pinto, A., Winston, A., Stöhr, W., Day, J., Wiggins, R., Quah, S. P.,… Paton, N. I. (2014). Neurocognitive function in HIV-infected patients: Comparison of two methods to define impairment. PloS One, 9(7), e103498. doi:10.1371/journal.pone.0103498
  • Axelrod, B. N., & Wall, J. R. (2007). Expectancy of impaired neuropsychological test scores in a non-clinical sample. The International Journal of Neuroscience, 117(11), 1591–1602. doi:10.1080/00207450600941189
  • Bell, M. L., Olivier, J., & King, M. T. (2013). Scientific rigour in psycho-oncology trials: Why and how to avoid common statistical errors. Psycho-Oncology, 505(22), 499–505.
  • Berthelson, L., Mulchan, S. S., Odland, A. P., Miller, L. J., & Mittenberg, W. (2013). False positive diagnosis of malingering due to the use of multiple effort tests. Brain Injury, 27(7–8), 909–916. doi:10.3109/02699052.2013.793400
  • BIG register. (2016, March 16). Retrieved from https://www.bigregister.nl/en/
  • Bilder, R. M., Sugar, C. A., & Hellemann, G. S. (2014). Cumulative false positive rates given multiple performance validity tests: Commentary on Davis and Millis (2014) and Larrabee (2014). The Clinical Neuropsychologist, 28(8), 1212–1223. doi:10.1080/13854046.2014.969774
  • Binder, L. M., Iverson, G. L., & Brooks, B. L. (2009). To err is human: “Abnormal” neuropsychological scores and variability are common in healthy adults. Archives of Clinical Neuropsychology : The Official Journal of the National Academy of Neuropsychologists, 24(1), 31–46. doi:10.1093/arclin/acn001
  • Blackford, R. C., & La Rue, A. (1989). Criteria for diagnosing age-associated memory impairment: Proposed improvements from the field. Developmental Neuropsychology, 5(4), 295–306. doi:10.1080/87565648909540440
  • Blakesley, R. E., Mazumdar, S., Dew, M. A., Houck, P. R., Tang, G., Reynolds, C. F., & Butters, M. A. (2009). Comparisons of methods for multiple hypothesis testing in neuropsychological research. Neuropsychology, 23(2), 255–264. doi:10.1037/a0012850
  • Brooks, B. (2010). Seeing the forest for the trees: Prevalence of low scores on the Wechsler Intelligence Scale for Children, fourth edition (WISC–IV). Psychological Assessment, 22(3), 650–656. doi:10.1037/a0019781
  • Brooks, B. L., Iverson, G. L., Feldman, H. H., & Holdnack, J. A. (2009). Minimizing misdiagnosis: Psychometric criteria for possible or probable memory impairment. Dementia and Geriatric Cognitive Disorders, 27(5), 439–450. doi:10.1159/000215390
  • Brooks, B. L., Iverson, G. L., Holdnack, J. A., & Feldman, H. H. (2008). Potential for misclassification of mild cognitive impairment: A study of memory scores on the Wechsler Memory Scale–III in healthy older adults. Journal of the International Neuropsychological Society: JINS, 14(3), 463–478. doi:10.1017/S1355617708080521
  • Cohen, S., Ter Stege, J. A., Geurtsen, G. J., Scherpbier, H. J., Kuijpers, T. W., Reiss, P., … Pajkrt, D. (2015). Poorer cognitive performance in perinatally HIV-infected children versus healthy socioeconomically matched controls. Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America, 1(60), 1111–1119. doi:10.1093/cid/ciu1144
  • Constantinidou, F., Wertheimer, J. C., Tsanadis, J., Evans, C., & Paul, D. R. (2012). Assessment of executive functioning in brain injury: Collaboration between speech-language pathology and neuropsychology for an integrative neuropsychological perspective. Brain Injury, 26(13–14), 1549–1563. doi:10.3109/02699052.2012.698786
  • Crawford, J., & Allan, K. (1994). The Mahalanobis Distance index of WAIS–R subtest scatter: Psychometric properties in a healthy UK sample. British Journal of Clinical Psychology, 33, 65–69. Retrieved from http://onlinelibrary.wiley.com/doi/10.1111/j.2044-8260.1994.tb01094.x/full
  • Crawford, J. R.. (2016, January 21). Estimating Percentage of Population Exhibiting Abnormally Low Scores and Score Differences. Retrieved from http://homepages.abdn.ac.uk/j.crawford/pages/dept/PercentAbnormKtests.htm
  • Crawford, J. R., & Garthwaite, P. H. (2002). Investigation of the single case in neuropsychology: Confidence limits on the abnormality of test scores and test score differences. Neuropsychologia, 40(8), 1196–1208. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11931923
  • Crawford, J. R., & Garthwaite, P. H. (2006). Comparing patients’ predicted test scores from a regression equation with their obtained scores: A significance test and point estimate of abnormality with accompanying confidence limits. Neuropsychology, 20(3), 259–271. doi:10.1037/0894-4105.20.3.259
  • Crawford, J. R., Garthwaite, P. H., Azzalini, A., Howell, D. C., & Laws, K. R. (2006). Testing for a deficit in single-case studies: Effects of departures from normality. Neuropsychologia, 44(4), 666–677. doi:10.1016/j.neuropsychologia.2005.06.001
  • Crawford, J. R., Garthwaite, P. H., & Gault, C. B. (2007). Estimating the percentage of the population with abnormally low scores (or abnormally large score differences) on standardized neuropsychological test batteries: A generic method with applications. Neuropsychology, 21(4), 419–430. doi:10.1037/0894-4105.21.4.419
  • Crawford, J. R., & Howell, D. C. (1998). Regression equations in clinical neuropsychology: An evaluation of statistical methods for comparing predicted and obtained scores. Journal of Clinical and Experimental Neuropsychology, 20(5), 755–762. doi:10.1076/jcen.20.5.755.1132
  • Crawford, J. R., Howell, D. C., & Garthwaite, P. H. (1998). Payne and Jones revisited: Estimating the abnormality of test score differences using a modified paired samples t test. Journal of Clinical and Experimental Neuropsychology, 20(6), 898–905. doi:10.1076/jcen.20.6.898.1112
  • Davis, J. J., & Millis, S. R. (2014). Examination of performance validity test failure in relation to number of tests administered. The Clinical Neuropsychologist, 28(2), 199–214. doi:10.1080/13854046.2014.884633
  • Eichstaedt, K. E., Kovatch, K., & Maroof, D. A. (2013). A less conservative method to adjust for familywise error rate in neuropsychological research: The Holm’s sequential Bonferroni procedure. NeuroRehabilitation, 32(3), 693–696. doi:10.3233/NRE-130893
  • Elwood, R. W. (2007). Clinical discriminations and neuropsychological tests: An appeal to Bayes’ theorem. Clinical Neuropsychologist, 7(2), 224–233. doi:10.1080/13854049308401527
  • Gavett, B. E. (2015). The value of Bayes’ theorem for interpreting abnormal test scores in cognitively healthy and clinical samples. Journal of the International Neuropsychological Society, 21(3), 249–257. doi:10.1017/S1355617715000168
  • Gisslén, M., Price, R. W., & Nilsson, S. (2011). The definition of HIV-associated neurocognitive disorders: Are we overestimating the real prevalence? BMC Infectious Diseases, 11(1), 356. doi:10.1186/1471-2334-11-356
  • González-Redondo, R., Toledo, J., Clavero, P., Lamet, I., García-García, D., García-Eulate, R.,… Rodríguez-Oroz, M. C. (2012). The impact of silent vascular brain burden in cognitive impairment in Parkinson’s disease. European Journal of Neurology: The Official Journal of the European Federation of Neurological Societies, 19(8), 1100–1107. doi:10.1111/j.1468-1331.2012.03682.x
  • Grasman, R. P. P. P., Huizenga, H. M., & Geurts, H. M. (2010). Departure from normality in multivariate normative comparison: The Cramér alternative for Hotelling’s T2. Neuropsychologia, 48(5), 1510–1516. doi:10.1016/j.neuropsychologia.2009.11.016
  • Grunseit, A. C., Perdices, M., Dunbar, N., & Cooper, D. A. (1994). Neuropsychological function in asymptomatic HIV-1 infection: Methodological issues. Journal of Clinical and Experimental Neuropsychology, 16(6), 898–910. doi:10.1080/01688639408402701
  • Höfler, M. (2005). The effect of misclassification on the estimation of association: A review. International Journal of Methods in Psychiatric Research, 14(2), 92–101. Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/mpr.20/abstract
  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. doi:10.2307/4615733
  • Huba, G. J. (1985). How unusual is a profile of test scores? Journal of Psychoeducational Assessment, 3(4), 321–325.
  • Huberty, C. J., & Morris, J. D. (1989). Multivariate analysis versus multiple univariate analyses. Psychological Bulletin, 105(2), 302–308. doi:10.1037//0033-2909.105.2.302
  • Huizenga, H. M., Smeding, H., Grasman, R. P. P. P., & Schmand, B. (2007). Multivariate normative comparisons. Neuropsychologia, 45(11), 2534–2542. doi:10.1016/j.neuropsychologia.2007.03.011
  • Ibrahim, J. G., Chen, M.-H., & Gray, R. J. (2002). Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association, 97(457), 88–99. doi:10.1198/016214502753479257
  • Ingraham, L. J., & Aiken, C. B. (1996). An empirical approach to determining criteria for abnormality in test batteries with multiple measures. Neuropsychology, 10(1), 120–124. doi:10.1037//0894-4105.10.1.120
  • Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. doi:10.1371/journal.pmed.0020124
  • Kazdin, A. E. (2008). Evidence-based treatment and practice: New opportunities to bridge clinical research and practice, enhance the knowledge base, and improve patient care. The American Psychologist, 63(3), 146–159. doi:10.1037/0003-066X.63.3.146
  • Kraemer, H. C., & Kupfer, D. J. (2006). Size of treatment effects and their importance to clinical research and practice. Biological Psychiatry, 59(11), 990–996. doi:10.1016/j.biopsych.2005.09.014
  • Larrabee, G. J. (2008). Aggregation across multiple indicators improves the detection of malingering: Relationship to likelihood ratios. The Clinical Neuropsychologist, 22(4), 666–679. doi:10.1080/13854040701494987
  • Larrabee, G. J. (2014). False-positive rates associated with the use of multiple performance and symptom validity tests. Archives of Clinical Neuropsychology: The Official Journal of the National Academy of Neuropsychologists, 29(4), 364–373. doi:10.1093/arclin/acu019
  • Levav, M., Mirsky, A. F., Herault, J., Xiong, L., Amir, N., & Andermann, E. (2002). Familial association of neuropsychological traits in patients with generalized and partial seizure disorders. Journal of Clinical and Experimental Neuropsychology, 24(3), 311–326. doi:10.1076/jcen.24.3.311.985
  • Lewis, M. S., Maruff, P., Silbert, B. S., Evered, L. A., & Scott, D. A. (2006). Detection of postoperative cognitive decline after coronary artery bypass graft surgery is affected by the number of neuropsychological tests in the assessment battery. The Annals of Thoracic Surgery, 81(6), 2097–2104. doi:10.1016/j.athoracsur.2006.01.044
  • Lezak, M. D., Howieson, D. B., Bigler, E. D., & Tranel, D. (2012). Neuropsychological assessment (5th ed.). New York, NY: Oxford University Press.
  • Loewenstein, D., Acevedo, A., Ownby, R., Agron, J., Barker, W., Isaacson, R.,… Duara, R. (2006). Using different memory cutoffs to assess mild cognitive impairment. The American Journal of Geriatric Psychiatry, 14(11), 911–919. Retrieved from http://www.sciencedirect.com/science/article/pii/S1064748112608707
  • Meehl, P. E., & Rosen, A. (1955). Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological Bulletin, 52(3), 194–216. doi:10.1037/h0048070
  • Meyer, A.-C. L., Boscardin, W. J., Kwasa, J. K., & Price, R. W. (2013). Is it time to rethink how neuropsychological tests are used to diagnose mild forms of HIV-associated neurocognitive disorders? Impact of false-positive rates on prevalence and power. Neuroepidemiology, 41(3–4), 208–216. doi:10.1159/000354629
  • Meyers, J. E., Miller, R. M., Thompson, L. M., Scalese, A. M., Allred, B. C., Rupp, Z. W.,… Junghyun Lee, A. (2014). Using likelihood ratios to detect invalid performance with performance validity measures. Archives of Clinical Neuropsychology: The Official Journal of the National Academy of Neuropsychologists, 29(3), 224–235. doi:10.1093/arclin/acu001
  • Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K., Gerber, A.,… van der Laan, M. (2014). Promoting transparency in social science research. Science, 343, 30–31. Retrieved from http://e-gap.org/wp/wp-content/uploads/2014/04/Transparency-UCB_2014-04-09.pdf
  • Multivariate normative comparisons. (2016, January 21). Retrieved from http://purl.org/net/rgrasman/mnc
  • Muslimovic, D., Post, B., Speelman, J. D., & Schmand, B. (2005). Cognitive profile of patients with newly diagnosed Parkinson disease. Neurology, 65(8), 1239–1245. doi:10.1212/01.wnl.0000180516.69442.95
  • Naglieri, J. A., & Paolitto, A. W. (2005). Ipsative comparisons of WISC–IV index scores. Applied Neuropsychology, 12(4), 208–211. doi:10.1207/s15324826an1204
  • Nichols, T. E., & Holmes, A. P. (2002). Nonparametric permutation tests for functional neuroimaging: A primer with examples. Human Brain Mapping, 15(1), 1–25. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11747097
  • Palmer, B. W., Boone, K. B., Lesser, I. M., & Wohl, M. A. (1998). Base rates of “impaired” neuropsychological test performance among healthy older adults. Archives of Clinical Neuropsychology, 13(6), 503–511. doi:10.1093/arclin/13.6.503
  • Proto, D. A., Pastorek, N. J., Miller, B. I., Romesser, J. M., Sim, A. H., & Linck, J. F. (2014). The dangers of failing one or more performance validity tests in individuals claiming mild traumatic brain injury-related postconcussive symptoms. Archives of Clinical Neuropsychology: The Official Journal of the National Academy of Neuropsychologists, 29(7), 614–624. doi:10.1093/arclin/acu044
  • R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/
  • Sander, A., Nakase-Richardson, R., Constantinidou, F., Wertheimer, J., & Paul, D. R. (2007). Memory assessment on an interdisciplinary rehabilitation team: A theoretically based framework. American Journal of Speech-Language Pathology, 16, 316–331. Retrieved from http://ajslp.pubs.asha.org/article.aspx?articleid=1757586
  • Schagen, S. B., Muller, M. J., Boogerd, W., Mellenbergh, G. J., & van Dam, F. S. A. M. (2006). Change in cognitive function after chemotherapy: A prospective longitudinal study in breast cancer patients. Journal of the National Cancer Institute, 98(23), 1742–1745. doi:10.1093/jnci/djj470
  • Schatz, P., Jay, K. A., McComb, J., & McLaughlin, J. R. (2005). Misuse of statistical tests in Archives of Clinical Neuropsychology publications. Archives of Clinical Neuropsychology: The Official Journal of the National Academy of Neuropsychologists, 20(8), 1053–1059. doi:10.1016/j.acn.2005.06.006
  • Schretlen, D. J., Testa, S. M., Winicki, J. M., Pearlson, G. D., & Gordon, B. (2008). Frequency and bases of abnormal performance by healthy adults on neuropsychological testing. Journal of the International Neuropsychological Society: JINS, 14(3), 436–445. doi:10.1017/S1355617708080387
  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. doi:10.1177/0956797611417632
  • Smeding, H. M. M., Speelman, J. D., Huizenga, H. M., Schuurman, P. R., & Schmand, B. (2011). Predictors of cognitive and psychosocial outcome after STN DBS in Parkinson’s disease. Journal of Neurology, Neurosurgery, and Psychiatry, 82(7), 754–760. doi:10.1136/jnnp.2007.140012
  • Sokal, R. R., & Rohlf, J. F. (1995). Biometry. San Francisco, CA: W. H. Freeman.
  • Su, T., Schouten, J., Geurtsen, G. J., Wit, F. W., Stolte, I. G., Prins, M.,… Schmand, B. A. (2015). Multivariate normative comparison, a novel method for more reliably detecting cognitive impairment in HIV infection. AIDS, 29(5), 547–557. doi:10.1097/QAD.0000000000000573
  • Torti, C., Focà, E., Cesana, B. M., & Lescure, F. X. (2011). Asymptomatic neurocognitive disorders in patients infected by HIV: Fact or fiction? BMC Medicine, 9(1), 138. doi:10.1186/1741-7015-9-138
  • Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). Hoboken, NJ: John Wiley & Sons.
  • Wilson, C. E., Happé, F., Wheelwright, S. J., Ecker, C., Lombardo, M. V, Johnston, P., … Murphy, D. G. M. (2014). The neuropsychology of male adults with high-functioning autism or Asperger syndrome. Autism Research: Official Journal of the International Society for Autism Research, 7(5), 568–581. doi:10.1002/aur.1394
  • Wilson, S. J., Baxendale, S., Barr, W., Hamed, S., Langfitt, J., Samson, S., … Smith, M.-L. (2015). Indications and expectations for neuropsychological assessment in routine epilepsy care: Report of the ILAE Neuropsychology Task Force, Diagnostic Methods Commission, 2013–2017. Epilepsia, 56(5), 674–681. doi:10.1111/epi.12962

Appendix

Algorithms

In this appendix we first give a more detailed description of the one-step resampling algorithm and then indicate how it is extended to the step-down resampling algorithm. R code is also provided.

One-step resampling algorithm

The one-step algorithm yields the distribution of the maximum absolute ttest statistic under the null hypothesis, the max distribution, by means of the so-called permutation approach to resampling (cf. Anderson & Legendre, Citation1999; Nichols & Holmes, Citation2002). Thereafter, each absolute statistic of an individual is compared to this max distribution.

Let n= 1, …, N denote N participants in the normative sample, and let m= 1, . . ., M denote M neuropsychological tests. A vector contains N centered test scores on the mth neuropsychological test. For example, if N= 6, may equal [2,4,–2,–4,2,–2]—that is, on test m, the first participant in the normative sample has a centered score of 2, the second participant has a centered score of 4, and so on. The scalar denotes an individual’s centered score on the mth test. Then perform the following computations (C1–C5).

  • C1: As normative data have been centered, a resample can be obtained by multiplying each element in by a randomly chosen +1 or a –1 (cf. Nichols & Holmes, Citation2002). For example, the centered normative data on the mth test = [2, 4, –2, –4, 2, –2] are multiplied by a randomly generated vector [+1, –1, +1, +1, –1, +1] yielding the resampling data = [2, –4, –2, –4, –2, –2]. In order to leave the correlation structure between variables intact, it is crucial that each test is multiplied by the same randomly generated vector, so , , …, are all multiplied by [+1, –1, +1, +1, –1, +1].

  • C2: Compute the statistic in this resampling dataset for each of the m= 1, . . ., M tests. In computing this statistic, is set to zero, as we are interested in the distribution under the null hypothesis.

  • C3: Determine the maximum over the M absolute statistics obtained in Step 2. This yields the max statistic.Repeat C1 to C3 several, say L, times. In the present simulation study, L is set to 2000.

  • C4: Store the L max statistics; this yields the required distribution of the maximum absolute ttest statistic under the null hypothesis, the max distribution.

  • C5: Determine, for each variable m, where the absolute statistic is located in the max distribution. In the case of a two-sided hypothesis, if an absolute statistic is located beyond the 95th percentile of the max distribution, this indicates that the individual deviates from the norm on that particular neuropsychological test. In case of a one-sided hypothesis—that is, that an individual performs worse than the norm—two conditions should be satisfied: The sign of should be in the expected direction, and the absolute value of should be located beyond the 90th percentile of the max distribution. Note that the two- and one-sided critical values are 90% and 95% and not 95% and 97.5 since the max distribution concerns maxima of absolute t-values.

Step-down resampling algorithm

Absolute statistics are first ordered from high to low. The highest absolute statistic is referred to the tmax distribution, as outlined above. The next highest statistic is referred to the tmax distribution derived over all neuropsychological tests, except the test for which the highest statistic was observed. The second next highest statistic is referred to the tmax distribution derived over all neuropsychological tests, except the two tests for which the two highest statistics were observed, and so on. The p-values thus obtained are subjected to a correction, imposing that p-values of low absolute statistics can never be lower than p-values of high absolute statistics. That is, a p-value is the maximum of the current p-value and the corrected p-values associated with higher absolute statistics.

To view a color version of the R-code algorithm, please see the online issue of the Journal.