201
Views
18
CrossRef citations to date
0
Altmetric
Original Research

National Institutes of Health Toolbox Emotion Battery for English- and Spanish-speaking adults: normative data and factor-based summary scores

, , , &
Pages 115-127 | Published online: 15 Mar 2018

Abstract

Background

The National Institutes of Health Toolbox Emotion Battery (NIHTB-EB) is a “common currency”, computerized assessment developed to measure the full spectrum of emotional health. Though comprehensive, the NIHTB-EB’s 17 scales may be unwieldy for users aiming to capture more global indices of emotional functioning.

Methods

NIHTB-EB was administered to 1,036 English-speaking and 408 Spanish-speaking adults as a part of the NIH Toolbox norming project. We examined the factor structure of the NIHTB-EB in English- and Spanish-speaking adults and developed factor analysis-based summary scores. Census-weighted norms were presented for English speakers, and sample-weighted norms were presented for Spanish speakers.

Results

Exploratory factor analysis for both English- and Spanish-speaking cohorts resulted in the same 3-factor solution: 1) negative affect, 2) social satisfaction, and 3) psychological well-being. Confirmatory factor analysis supported similar factor structures for English- and Spanish-speaking cohorts. Model fit indices fell within the acceptable/good range, and our final solution was optimal compared to other solutions.

Conclusion

Summary scores based upon the normative samples appear to be psychometrically supported and should be applied to clinical samples to further validate the factor structures and investigate rates of problematic emotions in medical and psychiatric populations.

Introduction

The National Institutes of Health (NIH) Toolbox Assessment of Neurological and Behavioral Function (www.nihtoolbox.org) is a set of brief measures that assess cognitive, emotional, motor, and sensory functions across the life span. It was commissioned by the NIH Blueprint for Neuroscience Research to provide a widely accessible, easy to administer, brief method assessing multiple aspects of health in a way that can be uniform across neurological research.Citation1 Because the battery provides a “common currency” across clinical and research settings, it can be used to monitor neurological and behavioral functioning across the life span with various health conditions and their treatments. The NIH Toolbox Emotion Battery (NIHTB-EB) was created in response to consensus from an expert panel identifying the need to measure both positive and negative aspects of emotions in a standardized manner.Citation2 NIHTB-EB evolved out of the Patient-Reported Outcomes Measurement Information System (PROMIS). The PROMIS battery focused on the impact of chronic conditions on health-related quality of life (HRQL).Citation3 At the time, PROMIS included items on depression, anxiety, and emotional distress.Citation4 Recognizing the full spectrum of emotional life and its impact on health, the NIH Toolbox mandate was to develop an assessment tool with a broad focus rather than only assessing negative emotions.

Leveraging the decades of work characterizing the relationship between emotional functioning and health, an expert panel of investigators funded by the NIH identified four theoretically relevant subdomains for inclusion in the NIHTB-EB: negative affect, psychological well-being, stress and self-efficacy, and social relationships.Citation5 Specifically, given that negative and positive emotions are relatively independent of each other and not necessarily opposite extremes of one continuum,Citation6Citation11 the NIHTB-EB aimed to assess negative and positive psychological functioning separately. Additionally, there is a strong, bidirectional relationship between social relationships and emotional health;Citation12 therefore, the NIHTB-EB aimed to tap into the interpersonal aspects of everyday life, such as support and friendship. Finally, perceptions of stress and self-efficacy significantly impact physical health and mental health both directly (eg, adverse physical effects of stress-related cortisol) and indirectly (eg, selection and application of coping strategies)Citation13Citation15 and were therefore considered for inclusion in the final battery.

After the theoretically relevant domains were identified, the committee for the development of the NIHTB-EB was tasked with the selection of psychometrically sound and nonproprietary measures, as well as generation of item banks to measure each of these important constructs when an already existing measure was unavailable. Expert feedback and literature review informed the selection of the item banks for the different scales of the NIHTB-EB.Citation16 For example, the team of researchers who worked on the negative affect scales included items from the PROMIS item bank and other well-known measures specific to negative emotions.Citation6 Selections were then made on all of the items that were to be included, and these items went through extensive calibration to promote the Toolbox agenda focused on creating a useful and efficient tool to assess emotions.

Although much thought and consideration went into the selection of the items within each domain, there has not been a comprehensive study within the large normative database, examining the specific domains that the final 17 individual scales represent. There has also been no method proposed for obtaining summary scores for the respective domains. The purpose of this study was to evaluate and compare the factor structure of the NIHTB-EB scales in English- and Spanish-speaking adults through exploratory and confirmatory factor analyses, as well as to begin exploring sociodemographic effects on the battery. Our goal was to identify composite scales based on the factor analyses findings and provide formulas such that the composite measures may be implemented across research and clinical settings that utilize the NIHTB-EB. Census-weighted normative data were provided for English-speaking adults and sample-weighted norms for Spanish speakers.

Methods

Participants and procedures

The NIH Toolbox normative sample of adults consisted of healthy community-dwelling individuals 18–85 years old who were recruited across 10 testing sites using a stratified sampling strategy (strata: age, gender, primary language).Citation17 Potential study participants were randomly selected from existing databases and completed a telephone screen to determine eligibility based on sociodemographic and linguistic categories. Additional participant inclusion criteria included 1) community-dwelling and noninstitutionalized, 2) ability to follow instructions in English or Spanish, and 3) having adequate physical capability (visual, auditory, vestibular, and motor functions) either independently or with assistive devices, to complete the full Toolbox battery (including also the cognition, motor, and sensory modules).Citation18 Notably, included adults were presumed to be healthy but who were not explicitly screened or excluded for psychiatric history. Research associates who went through training and certification processes, overseen by a team from Northwestern University, conducted structured interviews to help identify those who could be included in the normative project. Certifiers at Northwestern University had the role of site monitors and supervised all aspects of data collection from set up of data collection to quality assurance throughout the data gathering process. This study complied with the ethical rules for human experimentation stated in the Declaration of Helsinki, with Northwestern University’s institutional review board’s approval, and written informed consent was obtained from all participants.

Participants included in the NIHTB-EB analysis were 1,036 English-speaking and 408 Spanish-speaking adults who self-identified demographic characteristics (). All participants who completed the battery in Spanish self-identified their ethnicity as Hispanic. In the English-speaking cohort, 67% identified their ethnicity as non-Hispanic White, 15% as non-Hispanic Black, 13% as Hispanic, and 5% as non-Hispanic others. Demographic comparisons between English and Spanish battery completers revealed that the Spanish sample was younger, with lower education and annual household incomes (P’s<0.001). Spanish and English speakers were comparable on gender.

Table 1 Sample characteristics of total adult sample and subsample with additional sociodemographic data (mean, SD, %)

A subset of individuals (n=235, 128 English speakers and 107 Spanish speakers) provided additional demographic information regarding social factors such as marital status, number of children, and social interactions defined by number of people with whom one interacts within a 2-week period. When comparing this subset of individuals with the larger group, those who provided these additional variables were younger (mean = 42.9 years, SD =15.4; versus mean = 48.6, SD =18.6; P<0.001) and were more likely to be female (X2[1, N=1,444]=4.6, P=0.03) with fewer years of education (mean = 12.3 years, SD =4.2; versus mean = 13.3, SD =3.4; P<0.001). When comparing English and Spanish speakers on the additional sociodemographic variables, groups did not differ on marital status; however, Spanish speakers had more children (P=0.01) and reported having fewer social interactions in a 2-week time period (P=0.002).

Toolbox Emotion Battery

The NIHTB-EB for adults is a computerized assessment of emotions with 17 scales and four theoretically driven subdomains, developed based on psychometric analyses and consistency with the NIH Toolbox purpose ().Citation5 The battery takes ~20–30 minutes to complete, and it is self-administered. Detailed descriptions of the NIHTB-EB scales are included in the NIH Toolbox Score and Interpretation Guide (www.nihtoolbox.org) and are summarized in for the individual scales as well as for the final, confirmatory factor analyses (CFA)-based summary scores. Each item administered has a 5- or 7-point Likert scale with options ranging from “not at all” to “very much”. Each scale is scored using item response theory (IRT) methods, producing an IRT generated theta score. In IRT, the assumption is that all individuals have some degree of the underlying trait and the amount of that trait determines the probability that they will answer an item in a specific way.Citation19 Additionally, the battery is computer adaptive to accurately and efficiently assess each latent construct. This means that the items that an individual participant receives are dependent on his/her prior responses and therefore highly individualized to sensitively capture his/her emotional functioning; due to this approach, not all participants complete the exact same set of individual items. Scores more than one standard deviation below the mean (T<40) suggest low level of the trait, and scores of more than one standard deviation above the mean (T>60) suggest high level of the trait. All scales of the NIHTB-EB are freely available on the HealthMeasures.net website, and the correct page can be directly accessed using http://www.healthmeasures.net/explore-measurement-systems/nih-toolbox/obtain-and-administer-measures. Under “Obtain and Administer Measures”, select “Download a zip file of all available NIH Toolbox Emotion PDFs” and then open the zip file, open the “English” file, and select “Self-Report 18+” to view all scales in the adult battery.

Table 2 NIH Emotion Battery scales and original theoretically identified subdomains

Table 3 Summary descriptions of the NIHTB-EB composites and component scales

Derivation of 2010 US census-weighted normalized T-scores

We determined that normative adjustments for age and other demographic variables were not theoretically desirable or statistically necessary for the NIHTB-EB scores. That is, emotion scores are most usefully interpreted as reflecting the absolute amount of the trait in an individual, not the relative amount of the trait compared to others of that individual’s age or gender; additionally, we found that demographics were minimally associated with the NIHTB-EB scores (ie, <5% variance was accounted for on each scale). However, in order to ensure that the normative sample was as representative of the general US population as possible, we weighted our sample to reflect the demographics of the 2010 US census for English speakers. To achieve this, we applied raking procedureCitation20 using Statistical Analysis System macro “raking” by Battaglia et al.Citation21 This method assigns a weight, which is demographically proportionate to US 2010 census data, based on a participant’s age, gender, education, and race/ethnicity.

For individual scales, raw (theta) scores for each scale in the census-weighted sample were converted to sample-based normalized T-scores (T=50; SD =10). Therefore, the normalized T-scores represent an individual’s emotional characteristics compared to the average English-speaking person in the USA.Citation18 For the Spanish-speaking cohort, raw (theta) scores were converted to sample-based T-scores without census-weighted corrections, given that there was no appropriate census data for this cohort. Therefore, normalized T-scores on the Spanish NIHTB-EB represent an individual’s affective characteristics compared to our large normative cohort of Spanish-speaking adults.Citation22

NIHTB-EB factor analyses

In order to create summary scores that reflect the underlying latent structure of the NIHTB-EB, factor analyses were conducted using single sample cross-validation methodologies. Specifically, English- and Spanish-speaking samples were split into two samples within each group stratified on gender and age. For English speakers, one subsample (n=636) was used for exploratory factor analyses (EFA) and another subsample (n=400) was used for CFA. Similarly, for Spanish speakers one sub-sample (n=208) was used for EFA and the other subsample (n=200) for CFA. In this way, the latent constructs underlying the NIHTB-EB scales could be examined with EFA and validated with CFA in a separate sample. All factor analyses were performed on raw (theta) scores for English- and Spanish-speaking cohorts separately, using the R software and the “lavaan” package.Citation23

To identify underlying latent factors, EFA with maximum likelihood estimation was used to calculate eigenvalues and determine the number of factors to extract using multiple approaches. A multiple approach to data reduction, rather than use of single criteria (eg, scree test, eigenvalues >1, and cumulative percent of variance extracted), has been suggested to be the best practice in EFA research.Citation24 The eigenvalues were obtained from a principal components analysis. A conservative approach was initially taken for factor extraction. If fewer than the appropriate number of factors are initially extracted, the factors may include excessive errors due to important variables going unnoticed. The salient loading criterion adds 50% to what is suggested by eigenvalue criteria. Strength of the scale loadings on each factor was examined, and factors with a minimum of three scales loading >0.3 or 2 scales loading >0.5 were retained. Also, consistent with salient loading criterion, scales that did not demonstrate a minimum of a.13 margin from the factor it loaded the highest on were removed for the analysis until there were no cross-loadings within a.13 margin. An oblique promax rotation of the extracted factors was utilized to achieve the simplest structure. Inter-item correlations and Cronbach’s a were examined to calculate internal consistency estimates of reliability. Seventeen scales were entered into the EFA.

To validate the best-fitting models determined from a priori hypotheses and the EFA step, CFAs were performed. Specifically, the latent structure of the theoretically pre-existing subdomains (4-factor solution), a 1-factor (all scales), 2-factor (positive and negative scales), and the factor solution derived from the EFA step were examined with a CFA approach (refer Table S1 for the specific scales within each factor solution). The distributions for each of the 17 scales were first examined for normality. CFA for each factor model was conducted using maximum likelihood estimation with robust (Huber-White) standard errors while also modeling correlation among factors. Use of the chi-square likelihood ratio test to assess model fit has been deemed unsatisfactory for numerous reasons.Citation25 Rather, many researchers have suggested the use of multiple measures of model fit.Citation26 Therefore, the following measures of model fit were used: 1) the comparative fit index (CFI),Citation27 which compares the target model to a baseline null model that specifies no factors (values >0.90 indicate adequate model fit and values >0.93 indicate good model fit); 2) the root mean square error of approximation (RMSEA),Citation28 which adjusts fit by weighting values by the number of parameters estimated (values <0.08 indicate adequate model fit while <0.05 indicate good model fit);Citation29 and 3) standardized root mean square residual (SRMR),Citation30 which is an absolute measure of fit defined as the standardized difference between the observed correlation and the predicted correlation (values <0.08 indicate good model fit). Using these indices, the best fitting and most parsimonious factor model were identified. To maximize model fit, we revised the best fitting model using the Wald test,Citation31 which identifies scales that if dropped would improve overall model fit, and proceeded to examination of the standardized factor loadings for each scale.

Summary score creation

We used the best fitting model from CFA to create summary scores in the full sample, which included all participants. The full sample was used in this step to provide the most precise estimates in our summary score equations (N=1,026 for English and N=408 for Spanish). Specifically, summary scores were created by weighting the raw (theta) score for each participant’s individual scale by the CFA standardized factor loadings and then averaging across scales within a latent domain. The weighted average scores were then normalized to a T-score distribution (mean 50 and standard deviation10) similar to how individual normalized scales were created.

Potentially problematic emotion cut-point

We established cut-points of more than one standard deviation below the mean (T<40) for positive emotion scales and more than one standard deviation above the mean (T>60) for negative emotion scales to indicate a “potentially problematic” emotion across the summary scores (refer for each scale’s problematic direction).Citation32 Using the normal curve, we expect such a cut-point to demonstrate ~84% specificity (ie, ~16% potentially problematic emotion) among a general population of healthy individuals.

To help control for Type I error due to large sample sizes and multiple analyses, a somewhat conservative a value of 0.01 was used to indicate significance for all analyses.

Results

Exploratory factor analyses

The EFA of the 17 scales for the stratified sample (n=636 English speakers, n=208 Spanish speakers) supported the same 3-factor solution for the English- and Spanish-speaking cohorts. Seven scales (fear affect, anger affect, sadness, perceived stress, anger hostility, fear somatic arousal, and anger physical aggression) loaded saliently on Factor 1 (negative affect). Five scales (friendship, emotional support, instrumental support, and reverse-scored loneliness and perceived rejection) loaded saliently on Factor 2 (social satisfaction). Three scales (meaning, life satisfaction, and positive affect) loaded saliently on Factor 3 (psychological well-being). Self-efficacy and perceived hostility did not load saliently for either cohort ().

Table 4 Oblique rotated factor loadings of exploratory factor analysis from split sample

For the English-speaking cohort, Factor 1 explained 23% of the variance (Cronbach’s a=0.86), Factor 2 explained 18% of the variance (Cronbach’s a=0.84), and Factor 3 explained 13% of the variance (Cronbach’s a=0.84). Together, the factor structure accounted for 54% of the total variance. For the Spanish-speaking cohort, Factor 1 explained 24% of the variance (Cronbach’s a=0.86), Factor 2 explained 17% of the variance (Cronbach’s a=0.85), and Factor 3 explained 15% of the variance (Cronbach’s a=0.82). Together, the factor structure accounted for 57% of the total variance.

Confirmatory factor analyses

As with EFA, the distributions of the 17 scales were adequate for CFA analyses. reports the χ2 test statistics and model fit indices for all CFA models for English (n=400) and Spanish (n=200) administered scales. Anger physical aggression and fear somatic arousal were the lowest weighting scales on negative affect (loading ~0.40), and given that these scales did not improve fit indices, reduced parsimony, and were theoretically peripheral, they were excluded from the final CFA models.

Table 5 CFA model fit indices from split samples

The revised 3-factor model derived from the EFA step was the most parsimonious and best fitting model, as indicated by the CFI, RMSEA, and SRMR indices. presents the standardized factor loadings for the best fitting 3-factor model, all of which were significant at P<0.001. presents the correlation matrix among latent variables. For each language sample, negative affect was negatively associated with social satisfaction and psychological well-being. Also, social satisfaction and psychological well-being were positively associated with each other ().

Table 6 CFA model factor loadings for split sample

Table 7 CFA model latent variable correlations from split sample

To better understand if there were gender differences in our CFA model, we examined gender invariance for each language group separately. Among English speakers, there were no statistical differences between males (χ2=334.58, CFI =0.90, RMSEA =0.102, SRMR =0.061) and females (χ2=412.71, CFI =0.92, RMSEA =0.093, SRMR =0.054) from the results of a χ2 test comparing the models (P>0.05). Similarly, among Spanish speakers, there were no statistical differences between males (χ2=445.90, CFI =0.90, RMSEA =0.105, SRMR =0.063) and females (χ2=498.51, CFI =0.93, RMSEA =0.088, SRMR =0.050) from the results of a χ2 test comparing the models (P>0.05). However, among both language groups, our revised 3-factor CFA model fit was slightly better for females compared to males.

In an effort to remain consistent with the original work of NIHTB researchers, examination of the scales which make-up each factor and the underlying construct led us to title Factor 1 as negative affect (NA). The scales in Factor 1 have the common theme of negative emotions (fear, anger, sadness, stress). Factor 2 is titled social satisfaction (SS), which included the common theme of the sense of support by others, connection to others, and how one feels others’ view him/her. Finally, Factor 3 is called psychological well-being (PWB) with scales that target positive emotions and the common theme of feeling content with aspects of self and life. provides brief descriptions of emotions assessed by all individual NIHTB-EB scales and factor-based summary scores.

To test for measurement invariance between English- and Spanish-speaking samples in the 3-factor model, we examined increasingly restricted models between groups including 1) configural invariance (identical factor structures), 2) weak invariance (factor loadings are constrained to be equal), and 3) strong invariance (factor loadings and intercepts constrained to be equal). Comparisons of models revealed nonsignificant changes in χ2 comparing configural to weak invariance (Δχ2=9.59, df =10, P=0.477) and comparing strong to weak invariance (Δχ2=8.90, df =10, P=0.542). Additionally, there were no changes in CFI indices in these comparisons but there were small changes in RMSEA (ΔRMSEA =0.004 for both comparisons). Thus, these findings suggest that the 3-factor model is equivalent between English and Spanish groups.

Finally, we applied the best fitting 3-factor model to the full sample in each language group in order to have the most precise estimates for the purpose of creating summary scores. Parameter estimates from these CFA models were used to generate the summary score equations (Table S2).

Conversion to T-scores

Based on the normative data, Tables S3 and S4 present mean and standard deviation of individual scales, along with formulas for conversion of raw scores to normalized T-scores, separately for English and Spanish versions of the of the NIHTB-EB. Values provided are census weighted for English speakers and sample weighted for Spanish speakers.

To determine whether demographic characteristics were significantly associated with results on the NIHTB-EB, the effect of age, education, gender, ethnicity, and household income was evaluated through individual regression analyses for each individual scale, separately for English and Spanish speakers. Significant effect sizes, measured in individual adjusted R-squared value, ranged from 0.005 to 0.048 for English speakers and 0.017 to 0.033 for Spanish speakers. Because the results indicated relatively small effect sizes for demographic variables and also because our goal was to provide scores for emotional functioning, which address the question of whether an individual is reporting high or low levels of the specific emotion, we elected not to recommend or provide demographic corrections for the Emotion Toolbox.

Although we are not correcting for age, gender, or education, to an extent we are accounting for the linguistic and associated cultural background influences that may be observed on test performances by providing separate normative formulas for those administered the battery in Spanish and English. Our group will provide details of demographic effects on NIHTB-EB scores, separately by linguistic groups, in a manuscript following this project.

Summary scores and base rates

Summary scores based on CFA results and factor weights for Spanish and English versions of the battery are provided in Table S2. To establish base rates for potentially problematic emotional functioning, emotional distress was defined by more than one standard deviation beyond the mean in the problematic direction for each scale and composite. Base rates for problematic emotions in the normative sample for the English-speaking cohort revealed 13.9% problematic emotions for negative affect, 16.8% for social satisfaction, and 15.2% for psychological well-being. Base rates of problematic emotions for the Spanish-speaking cohort revealed 18.4% with distress for negative affect, 18.2% for social support, and 13.0% for psychological well-being.

Social factors and summary scores

A majority of individuals in the combined language samples (n=1,083) provided information regarding total annual household income. Individuals with a household income ≥US$40,000 reported significantly more social satisfaction (mean = 50.89, SD =9.54; versus mean = 47.97, SD =10.72; P<0.001) and psychological well-being (mean = 50.36, SD =9.47; versus mean = 47.65, SD =10.47; P=0.0016), as well as slightly less negative affect (mean = 49.42, SD =9.32; versus mean = 51.21, SD =10.93; P=0.0169). There was an interaction between income and language for negative affect, P=0.030. English speakers who had an annual household income ≥$40,000 reported significantly less negative affect compared with those with less income (mean = 49.60, SD =9.34; versus mean = 53.51, SD =11.93; P<0.001). There was no effect of income on negative affect for Spanish speakers (income >$40,000, mean =48.49, SD =9.16; versus income <$40,000, mean =49.16, SD =9.57).

A much smaller subset of individuals (n=235) provided information on additional sociodemographic variables. Although the psychological well-being summary score was not computed for this subsample due to some missing data on the three scales that make up this factor, summary scores for negative affect and social satisfaction were computed and are available. Relevant to the representativeness of this subsample, their mean negative affect and social satisfaction scores were quite similar to the average results for the total sample (negative affect mean =49.83, SD =9.69; social satisfaction mean =50.22, SD =10.04). Results indicate that individuals who were married reported significantly less negative affect (mean = 48.46, SD =8.63; versus mean = 52.27, SD =10.96; P=0.004; d=0.39), as well as more social satisfaction (mean = 51.85, SD =9.28; versus mean = 47.32, SD =10.71; P<0.001; d=0.45) compared with those not married. There was a borderline interaction between marital status and language for social satisfaction, P=0.0456. For English speakers (n=127), being married was associated with greater social satisfaction (mean = 52.00, SD =9.43; versus mean = 45.12, SD =10.39; P<0.001; d=0.69). However, for Spanish speakers, this was not the case (married, mean =51.68, SD =9.17 versus not married, mean=50.45, SD =10.55; d=0.12). Having children was not significantly associated with the two summary scores. The number of individuals with whom one interacts within a 2-week time span also was not significantly associated with negative affect; however, individuals with greater numbers of social interactions reported significantly greater social satisfaction (F[1, 228]=16.24, P<0.001).

Discussion

The NIHTB-EB provides a computerized method of briefly assessing a broad spectrum of emotional functioning by including both positive and negative aspects of emotions. Domains were selected by experts and item banks created from the PROMIS battery, already existing well-established nonproprietary measures, as well as new items created where prior measures could not be identified. In the end, 17 scales were developed as the core measures within the adult battery. In this study, we evaluated the domain structure of the NIHTB-EB for both English and Spanish speakers in a project aimed at creating summary scores, which has not been done previously. Here, we present census-weighted norms for the NIHTB-EB English speakers and sample-weighted norms for Spanish speakers. We have provided formulas that can be used to convert raw scores (theta scores provided by the NIHTB Assessment Center program) to standard T-scores for English and Spanish speakers separately, based on data from the normative samples. Demographically uncorrected scores are provided and, for English speakers, can be interpreted as reflecting an individual’s absolute level of that emotion compared to the average English-speaking US adult. These scores are also on a common metric, which may facilitate profile analyses and longitudinal comparisons. We identified three distinct constructs (negative affect, social satisfaction, and psychological well-being) and provided formulas using factor weights from CFA results for computing the summary scores. The final model and summary scores are applicable to both English- and Spanish-speaking adults. Given that we based our corrections on the normal curve, which estimates ~16% of the population will fall one standard deviation above and below the mean, respectively, the base rates on our normative samples are commensurate with expectations for a normal distribution (refer for scale descriptions and domain specifications). Base rates set in the normative sample with these summary scores can be applied to clinical samples to help differentiate problematic emotions across the identified domains.

In the absence of any gold standard assessments for validating cut-points in the current study, we tentatively use the term potentially problematic (not “abnormal”) to interpret scores beyond the one standard deviation point in the direction of distress. Of course, clinicians and researchers are free to set their own cut-points, especially as may be informed by future investigations of the NIHTB-EB. In this regard, however, we would advance the following considerations for NIHTB-EB users: in view of the fact that this battery aims to assess both positive and negative emotions and is intended for use with the general population as well as with clinical samples, it may be too restrictive to use cut-points that would classify almost all nonclinical (or undiagnosed) individuals in the general population as having nonproblematic emotional functioning.

The summary scores presented here can be used across research and clinical settings to aid in more efficient or parsimonious interpretation of findings from NIHTB-EB’s 17 scales. Although greater breadth of information is provided by consideration of all the individual scales, summary/composite scores integrate a significant amount of information into one score and may show greater reliability than the individual component scales. The data points or scales within each composite are now statistically and conceptually related based on the analyses we conducted, and the single score reduced the potential for “information overload”, making the battery more user-friendly. In many situations, a more efficient and user-friendly approach is consistent with the NIH Toolbox objective. Additionally, we have begun to validate these summary scores by demonstrating their association with social variables. For example, greater number of social interactions is associated with an increased sense of social satisfaction on our social satisfaction summary score. The NIH Toolbox initiative has now incorporated these presented normative standards and computed summary scores into the NIHTB-EB iPad scoring program.

It is beyond the scope of the current project to fully address the validity of the NIHTB-EB, other than to report several relevant sociodemographic associations within the NIHTB national norming study. However, validation work with clinical samples is under way, and the findings and norms presented here are intended to be foundational for such efforts. Given the battery was originally developed in putatively healthy individuals’ representative of the national census, clinical studies across more severe and diverse psychopathologies will importantly inform the criterion validity of the battery. One major strength of the NIHTB-EB is its comprehensive approach to mental health status, including measures across positive and negative affect and social functioning, which may increase its ability to capture and characterize even nuanced differences in psychological functioning across neuropsychiatric disease Approximately 50 ongoing or completed studies (with >4,400 participants) are registered with the web-based NIHTB Assessment Center and include measures for the Emotion Battery summary scores, and some of these studies are beginning to report results with neurological samples (e.g., spinal cord injury, traumatic brain injury, and stroke).Citation33 The latter research found significant elevations in negative affect and lower levels of social satisfaction and psychological well-being in individuals with these neurological conditions compared to healthy adults, but also some differences across the neurological conditions. Additionally, the battery demonstrated sensitivity to improvement with treatment (transcranial magnet stimulation) in a recent case study of traumatic brain injury.Citation34 Nonetheless, given its novelty, continued work to support the sensitivity of the NIHTB-EB to mental health disease is needed. In addition, calculations for the current NIHTB-EB summary scores and norms have been programmed into a recent update of the NIHTB iPad app for use in ongoing and new studies.

There are several limitations in these newly developed normative standards. First, given this normative data is based on the US population and subtle cultural variations have been shown to impact how individuals report emotional health, generalizations cannot be made for international studies at this time. Also, we are not recommending demographic corrections for the Emotion Battery based on relatively small demographic effect sizes and interest for the investigation of emotions as the absolute level of that particular emotion compared to the average person residing in the USA. Interpreting scores of emotional functioning differs from interpretations used for cognitive functioning, which within the neuropsychological context aims to estimate the types and amounts of change in cognition that may have resulted from injury or disease affecting the central nervous system. Accurate classification of neuropsychological impairment, for example, is dependent on the normative comparison applied, such as what is the expected level of cognitive performance if the individual had a healthy brain and never acquired any central nervous system compromise.Citation18 Although CNS dysfunction may affect emotional functioning as well, premorbid emotional status (as reflected in the Toolbox normative samples) is much less associated with demographics than is cognition. Nevertheless, we recognize that the current norming process did not take into account subtle effects of demographic variables. We did observe a trend for older individuals to report less negative emotions, for example. These trends can be further explored within specific populations to better understand their stability and significance. Furthermore, in creating summary scores, the RMSEA fit indexes for our final CFA models were not <0.05, which has been suggested as cutoff for good model fit.Citation29 However, our other fit indices (ie, CFI and SRMR) suggested that our final models adequately fit the data and produced valid summary scores of emotion in our sample. Nonetheless, future research creating more complex factorial models may yield a more accurate understanding of the underlying latent structure of the Toolbox emotional battery.

In addition, although we were able to separate English and Spanish speakers and provide normative data for each cohort, in the Spanish-speaking cohort, there is variability that could be important to emotional functioning that was not accounted for by the norming project. Information such as country of origin and years since immigration to the USA within the Spanish-speaking cohort was not accounted for. With a larger sample of Spanish speakers, and more comprehensive data collection process that includes items specific to diversity, these factors could be further explored. Also, other potentially important background factors were not consistently assessed in the normative study. Variables specific to social support were not systematically assessed in the norming project, such as socioeconomic and marital status. For example, marital status was available for only ~17% of the sample and was found to be the largest contributor to the emotion scales at the group level (married individuals, as a group, tended to evidence somewhat better emotional health). We plan to report details of (relatively modest) associations with demographic factors in future report.

Also moving forward, application of these normative standards and summary scores with the NIHTB-EB among various clinical populations is warranted to provide validation of the factor structures. A major limitation to this study is the lack of concurrent or discriminant validity for the newly created summary scores. Within the normative sample, we did not have data available to compare the current summary scores with other more established emotional/psychological measures. We are in the process of assessing and reporting effects of various neurological and psychiatric conditions on the NIHTB-EB, in some cases in relation to the other Toolbox domain instruments (cognition, motor, sensory), and in some cases in relation to other emotion assessments and standardized assessments of current and lifetime histories of various psychiatric conditions (major depressive disorder, substance use disorders, ADHD, and ASPD). However, these projects will have smaller samples and different goals; therefore, they are not within the scope of this study.

Furthermore, research with clinical samples should consider profiles of the NIHTB-EB scores both across and within composite categories. For example, would individuals diagnosed with major depressive disorder (MDD) tend to score in the problematic direction on all three summary scores and, within the negative affect category, will sadness typically be identified as the most problematic? For individuals who are successfully treated for MDD, what patterns of changes will be observed on the NIHTB-EB? Answering similar questions will help validate the NIHTB-EB and the newly constructed scales’ construct validity. Solidified construct validity of the measure will increase its utility in clinical settings. This is particularly important given that there are not many methods of assessment for emotions that have a similar broad focus. Summary scores based on the normative samples appear to be psychometrically sound and should be applied to clinical samples to validate the factor structures as well as to investigate rates of problematic emotions in medical and psychiatric populations.

Acknowledgments

This study was supported by a cooperative agreement from the National Institutes of Health to Northwestern University (U2CCA186878; PI: David Cella, PhD). These contents do not necessarily represent an endorsement by the US Federal Government (refer www.healthmeasures.net for additional information). Funding for HealthMeasures was provided by the National Institutes of Health grant U2C CA186878. We wish to thank Michael Thomas, PhD, for his invaluable consultation on statistical methodologies used in this manuscript.

Supplementary materials

Table S1 Emotion Battery scales in factor solutions examined for best model fit

Table S2 Summary score formulas

Table S3 English-speaking raw scores conversion to standard scores

Table S4 Spanish-speaking raw scores conversion to standard scores

Disclosure

The authors report no conflicts of interest in this work.

References

  • GershonRCRothrockNHanrahanRBassMCellaDThe use of PROMIS and assessment center to deliver patient-reported outcome measures in clinical researchJ Appl Meas201011330431420847477
  • NowinskiCJVictorsonDDebbSMGershonRCInput on NIH toolbox inclusion criteria: surveying the end-user communityNeurology20138011 suppl 3S7S1223479548
  • RothrockNEHaysRDSpritzerKYountSERileyWCellaDRelative to the general US population, chronic diseases are associated with poorer health-related quality of life as measured by the Patient-Reported Outcomes Measurement Information System (PROMIS)J Clin Epidemiol201063111195120420688471
  • RevickiDACookKFAmtmannDHarnamNChenW-HKeefeFJExploratory and confirmatory factor analysis of the PROMIS pain quality item bankQual Life Res201423124525523836435
  • SalsmanJMButtZPilkonisPAEmotion assessment using the NIH toolbox. [Miscellaneous Article]Neurology20138011 suppl 3S76S8623479549
  • PilkonisPAChoiSWSalsmanJMAssessment of self-reported negative affect in the NIH toolboxPsychiatry Res20132061889723083918
  • WatsonDTellegenAToward a consensual structure of moodPsychol Bull19859822192353901060
  • DienerESuhEMLucasRESmithHLSubjective well-being: three decades of progressPsychol Bull19991252276302
  • NeubauerABVossAValidation and revision of a German version of the balanced measure of psychological needs scaleJ Individ Differ20163715672
  • RyffCDHappiness is everything, or is it? Explorations on the meaning of psychological well-beingJ Pers Soc Psychol198957610691081
  • SheldonKMHilpertJCThe balanced measure of psychological needs (BMPN) scale: an alternative domain general measure of need satisfactionMotiv Emot2012364439451
  • BaumeisterRFLearyMRThe need to belong: desire for interpersonal attachments as a fundamental human motivationPsychol Bull199511734975297777651
  • DeciELRyanRMThe “what” and “why” of goal pursuits: human needs and the self-determination of behaviorPsychol Inq2000114227268
  • GoldmanNGleiDASeplakiCLiuI-WWeinsteinMPerceived stress and physiological dysregulation in older adultsStress2005829510516019601
  • FranzCEO’BrienRCHaugerRLCross-sectional and 35-year longitudinal assessment of salivary cortisol and cognitive functioning: the Vietnam Era Twin Study of AgingPsychoneuroendocrinology20113671040105221295410
  • SalsmanJMLaiJSHendrieHCAssessing psychological well-being: self-report instruments for the NIH toolboxQual Life Res201423120521523771709
  • BeaumontJLHavlikRCookKFNorming plans for the NIH toolboxNeurology20138011 Suppl 3S87S9223479550
  • CasalettoKBUmlaufABeaumontJDemographically corrected normative standards for the English version of the NIH toolbox cognition batteryJ Int Neuropsychol Soc201521537839126030001
  • ReiseSEmbretsonSItem Response Theory for PsychologistsMahwah, New JerseyLawrence Erlbaum Associates2000
  • Edwards DemingWStephanFFEdwards DemingBWOn a least squares adjustment of a sampled frequency table when the expected marginal totals are knownSource Ann Math Stat1940114427444
  • BattagliaMPIzraelDHoaglinDCFrankelMRPractical considerations in raking survey dataSurv Pract200925137
  • CasalettoKBUmlaufAMarquineMDemographically corrected normative standards for the Spanish language version of the NIH toolbox cognition batteryJ Int Neuropsychol Soc201622336437426817924
  • RosseelYLavaan: an R package for structural equation modelingJ Stat Softw2012482136
  • OsborneJWCostelloABBest practices in exploratory factor analysis : four recommendations for getting the most from your analysisPract Assess Res Eval200510719
  • TanakaJ1993Multifaceted concepts of fit in structural equation modelsBollenKLongSTesting Structural Equation Models (10–39)Newberry Park CASage
  • HoyleRHPanterAT1995Writing about structural equation modelsHoyleRHStructural equation modeling: Concepts, issues, and applications (158–176)Thousand Oaks, CASage
  • BentlerPMComparative fit indexes in structural modelsPsychol Bull199010722382462320703
  • SteigerJHStructural model evaluation and modification: an interval estimation approachMultivariate Behav Res199025217318026794479
  • MacCallumRCBrowneMWSugawaraHMPower analysis and determination of sample size for covariance structure modelingPsychol Methods199612130149
  • HuLBentlerPMCutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternativesStruct Equ Model A Multidiscip J199961155
  • BrownT2015Confirmatory Factor Analysis for Applied ResearchSecond EditionNew York City, NYGuilford Press
  • TaylorMHeatonRSensitivity and specificity of WAIS-III/WMS-III demographically corrected factor scores in neuropsychological assessmentJ Int Neuropsychol Soc20017786787411771630
  • CarlozziNEGoodnightSCasalettoKBValidation of the NIH toolbox in individuals with neurologic disordersArch Clin Neuropsychol201732555557328334392
  • SiddiqiSHTrappNTHackerCDrTMS with individualized resting-state network mapping for neuropschiatric sequelae of repetitive traumatic brain injury in a retired nfl playerbioRxiv2017 Available from: https://www.biorxiv.org/content/early/2017/11/21/151696accessed February 19, 2018
  • Health Measures Available from: http://www.healthmeasures.net/explore-measurement-systems/nih-toolbox/obtain-and-administer-measuresAccessed February 19, 2018