60
Views
2
CrossRef citations to date
0
Altmetric
Original Research

Calibration and validation of an item bank for measuring general physical function of patients in medical rehabilitation settings

, , , , , & show all
Pages 11-16 | Published online: 28 Dec 2017

Abstract

Objective

The objective of this study was to report the item response theory (IRT) calibration of an 18-item bank to measure general physical function (GPF) in a wide range of conditions and evaluate the validity of the derived scores.

Methods

All 18 items were administered to a large sample of patients (n=2337) who responded to the items in the context of their outpatient rehabilitation care. The responses, collected 1997– 2000, were modeled using the graded response model, an IRT model appropriate for items with two or more response options. Inter-item consistency was evaluated based on Cronbach’s alpha and item to total correlations. Validity of scores was evaluated based on known-groups comparisons (age, number of health problems, symptom severity). The strength of a single, general factor was evaluated using a bi-factor model. Results were used to evaluate IRT assumption and as an indicator of construct validity. Local independence of item responses was also evaluated.

Results

Response data met the assumptions of unidimensionality and local independence. Explained common variance of a single general factor was 0.88 (omega hierarchical =0.86). Only two of the 153 pairs of item residuals were flagged for local dependence. Inter-item consistency was high (0.93) as were item to total correlations (mean =0.61). Substantial variation was found in both IRT location (difficulty) and discrimination parameters. All omnibus known-groups comparisons were statistically significant (p<0.001).

Conclusion

Item responses fit the IRT unidimensionality assumptions and were internally consistent. The usefulness of GPF scores in discriminating among patients with different levels of physical function was confirmed. Future studies should evaluate the validity of GPF scores based on an adaptive administration of items.

Introduction

The Institute of Medicine has advocated,Citation1 and a number of legislative efforts have supported,Citation2Citation4 incentivizing performance instead of volume for the US health care delivery system. The envisioned future of a responsive, effective, and efficient health care delivery system that incentivizes performance requires the existence of psychometrically sound patient-reported outcomes measures (PROMs). Increasingly, PROMs are being administered using a tailored approach, known as computer adaptive testing (CAT).Citation5,Citation6 CAT has been developed for use in health outcomes,Citation7,Citation8 rehabilitation,Citation9,Citation10 and clinical applications.Citation11,Citation12 Adaptive item administration is attractive because it reduces respondent burden with little erosion of measurement precision.Citation13,Citation14

Focus On Therapeutic Outcomes, Inc. (FOTO) is an international measurement system that has provided data collection and reporting of medical rehabilitation outcomes since 1994.Citation15,Citation16 In 2001, FOTO began administering PROMs using CAT. The use of CAT requires the development of a bank of items that measure the targeted outcome and whose items have been calibrated using an item response theory (IRT) model.Citation17 Most item banks developed by FOTO have targeted specific body parts.Citation18Citation23 The purpose of this paper is to report on the calibration and evaluation of an item bank that is domain- rather than body-part-specific – the general physical function (GPF) scale.

Methods

Participants

Study data were drawn from a convenience sample of 2337 adult patients who were treated in clinical facilities participating with FOTO. These participants responded to all 18 items of the GPF item bank and to demographic and clinical questions. Data were collected from 1997 to 2000 in 20 different states in the USA. The study research was ruled exempt from human subjects review by Northwestern University, Chicago, IL institutional review board because the research involved study of existing data which were recorded by the investigator in such a manner that participants cannot be identified.

Instrumentation

GPF item bank

The GPF item bank includes 18 items originally developed to measure functional status. Eleven of the items were adapted from the RAND 36-Item Short Form Health Survey.Citation24 The remainder was developed by FOTO clinician scientists to extend the effective measurement range of the measure. These items targeted lower levels of physical functioning to ensure good discrimination at the “floor” of the measure.

Demographics and clinical characteristics

In addition to responses to GPF items, patients reported their sex, age, impairment category, comorbidity and symptom acuity (“0” = Asymptomatic, no treatment needed at this time; “1” = Symptoms well controlled with current therapy; “2” = Symptoms controlled with difficulty, needs ongoing monitoring and affects daily functioning, “3” = Symptoms poorly controlled, needs frequent adjustment in treatment monitoring, and “4” = Symptoms poorly controlled, history of re-hospitalization).

Analyses

Item analyses, calibration, and scoring

Tests of IRT assumptions

Samejima’s logistic graded response model (GRM)Citation31 was used to calibrate item responses. Like most IRT models, the GRM assumes response data are unidimensional and locally independent.Citation17,Citation25 Typically, the unidimensionality assumption is tested based on a confirmatory factor analysis that posits a single factor model and then evaluates the fit of that model based on standard fit criteria. Newer approaches fit a bifactor model to allow a more direct evaluation of the relevant statistical question of whether item responses are unidimensional enough to warrant calibration using a unidimensional IRT model.Citation26 The bifactor model posits that all items load on a single general factor, and subsets of items load on a single, but different, group factors. From such a model, proportions of total (omega hierarchical) and common variance (explained common variance) accounted for by a general factor are estimated. To obtain these values, we fit a bifactor model using the psych package in R.Citation27 Reise et al recommended “tentative” minimum criterion for omega hierarchical of greater than 0.50 (with >0.75 being preferred)Citation26 and explained common variance ≥0.60.Citation28

Local independence was evaluated by extracting the residuals remaining after responses were fit to a unidimensional confirmatory factor model using MPlus.Citation29 IRT models assume that these residuals are not correlated. Standards for evaluating unidimensionality vary. Reeve et al recommended flagging and considering the deletion of items whose residuals correlate >0.20 with residuals of other items.Citation30

Item level analyses

To estimate inter-item consistency, we calculated Cronbach’s alpha. We also estimated the correlations between item scores and total scores on the remaining items. A range of 0.70 to 0.80 has been recommended as a standard for group level measurement.

IRT calibration and scoring

Responses to the 18 GPF items were calibrated to the GRMCitation31 using Parscale software.Citation32 The GRM is appropriate for items with ordered polytomous responses, which is the format of the GPF items. The GRM allows item discrimination parameters (a) to vary, which is common for functional status items.Citation33,Citation34 After the GRM was fit, a linear transformation was performed so that GPF scores ranged from 0 to 100.

Construct validation

Known-groups construct validity

We hypothesized that lower GPF scores would be observed for those who were older, reported greater symptom severity, and had a higher number of health conditions. Participant ages were grouped into the ranges 18–44, 45–65, and >65. The five symptom severity categories were placed into four comparison groups. Because few participants endorsed the most severe category (“4”), scores of “3” and “4” were grouped into a single category, both of which include the descriptor, “poorly controlled”. Comorbidity groups were those with none, one, two, three, and greater than three comorbidities.

Known-groups hypotheses were tested first at the omnibus level (groups are significantly different overall) using analysis of variance (ANOVA). Comparison between pairs of levels was accomplished using Dunnett T3 Post Hoc Test.Citation35

Unidimensionality

The evaluation of unidimensionality described previously served dual purposes. Unidimensionality is an assumption of the IRT model used to calibrate the item responses. A finding of unidimensionality also supports the construct validity of the measure in that it indicates that, as hypothesized, GPF is a single construct.

Results

summarizes the demographic and clinical characteristics of the sample. The majority of respondents were female (63.8%). Mean age in years was 61 (SD =18.3; range 18 to 99); 79.0% were 45 or older. The most common impairment category was stroke (22.4%) followed by orthopedic conditions (18.6%) and pain syndrome (14.4%). Just over half of the sample had experienced symptoms for more than 90 days (50.4%).

Table 1 Sample characteristics

Item analyses, calibration, and scoring

Tests of IRT assumptions

Based on a bi-factor model of responses to the 18 GPF items, we obtained an omega hierarchical value of 0.86 and an explained common variance of 0.88. These values are substantially higher than Reise et al’s suggested criteria for omega hierarchical (ie, greater than >0.75 preferred)Citation26 and explained common variance (ie, ≥0.60), supporting the unidimensionality of the item responses.Citation28

Assessment of local independence resulted in 153 possible paired comparisons between item residuals. Of these, only two had correlations >0.20. The residuals of the items, “How much does your health limit vigorous activities like running, lifting heavy objects, sports?” and “How much does your health limit participating in recreation?” had a correlation of 0.29. The residuals of the items, “How much does your health limit going on vacation?” and “How much does your health limit attending social events?” had a correlation of 0.26.

Item analyses

Cronbach’s alpha for the GPF item responses was very high (0.93). This result indicated very high inter-item consistency. The mean item score to total score correlation was 0.61. Correlation values ranged from 0.34 for the two-response item (“Do you limit the kind of work or other regular daily activities as a result of your physical health?”) to 0.74 (two items: “How much does your health limit climbing one flight of stairs/walking several blocks?”).

IRT calibration and scoring

presents the item parameter estimates obtained in the GRM calibration of the GPF items. Items varied in discrimination (a; slope) confirming the need for use of a two-parameter IRT model that accounts both for item location and item discrimination (one-parameter models’ slopes are equal across items). The average location (ie, difficulty) of items on the logit metric ranged from −0.68 (“How much does your health limit completing your toileting?”) to 2.24 (“How much does your health limit vigorous activities like running, lifting heavy objects, sports?”).

Table 2 Item parameters for the general physical function scale

Construct validation

All omnibus known-groups comparisons were statistically significant (p<0.001) (). All but one pair-wise post hoc group comparison was significant at this level. Those with two comorbidities did not have scores that were significantly greater than those with three or more (p=0.144). The results related to unidimensionality supported that functional status was a single construct when measured in patients in this context.

Table 3 Known-groups validity results

Limitation

A limitation of this study is that the items were presented to respondents as a full bank, which is convenient for item calibration and evaluation, but is different from administering using CAT. Future studies should evaluate the validity of GPF scores based on an adaptive administration of items.

Conclusion

We examined an item bank with the purpose of assessing GPF of patients receiving care in a rehabilitation setting. Based on the factor analytic results, we concluded that a dominant general factor drove responses to items in this large and medically diverse sample, supporting the unidimensionality of the scale. The assumption of local independence was largely upheld. Inter-item consistency was very high (0.93), and, if the GPF items were intended as a single, 18-item measure, would warrant concerns about redundancy. However, the items were developed as an item bank for CAT administration. Because Cronbach alpha values are a function of the number of items in the scale as well as covariances between item pair responses and variance in total score, values are typically high in item banks where the number of items tend to be larger. The usefulness of GPF scores in discriminating among patients with different levels of functional status was confirmed by the results of the known-groups analyses. The GPF scores effectively distinguished groups expected to have different score levels.

Disclosure

The authors report no conflicts of interest in this work.

References

  • Institute of MedicineRewarding Provider Performance: Aligning Incentives in MedicareWashington, DCNational Academies Press2006
  • GrassleyCMedicare Value Purchasing Act of 2005USSenateS.13562005 Available from: https://www.congress.gov/bill/109th-congress/senate-bill/1356Accessed December 1, 2017
  • JohnsonNMedicare Value-Based Purchasing for Physicians’ Services Act of 2005. U.S. House of RepresentativesH.R.36172005 Available from: https://www.congress.gov/bill/109th-congress/house-bill/3617Accessed December 1, 2017
  • WilsonNMedicare Outpatient Therapy Value-Based Purchasing Act of 2006U.S. House of RepresentativesUSHoH.R.60482006 Available from: https://www.congress.gov/bill/109th-congress/house-bill/6048Accessed December 1, 2017
  • HartDLDeutscherDWernekeMWHolderJWangYCImplementing computerized adaptive tests in routine clinical practice: experience implementing CATsJ Appl Meas201011328830320847476
  • JetteAMHaleySMContemporary measurement techniques for rehabilitation outcomes assessmentJ Rehabil Med200537633934516287664
  • KisalaPATulskyDSPaceNVictorsonDChoiSWHeinemannAWMeasuring stigma after spinal cord injury: development and psychometric characteristics of the SCI-QOL Stigma item bank and short formJ Spinal Cord Med201538338639626010973
  • SungVWGriffithJWRogersRGRakerCAClarkMAItem bank development, calibration and validation for patient-reported outcomes in female urinary incontinenceQual Life Res20162571645165426732514
  • AmtmannDCookKFJohnsonKLCellaDThe PROMIS initiative: involvement of rehabilitation stakeholders in development and examples of applications in rehabilitation researchArch Phys Med Rehabil20119210 SupplS12S1921958918
  • JetteAMHaleySMTaoWNiPMoedRMeyersDZurekMProspective evaluation of the AM-PAC-CAT in outpatient rehabilitation settingsPhys Ther200787438539817311888
  • CookKFBuckenmaierC3rdGershonRCPASTOR/PROMIS (R) pain outcomes system: what does it mean to pain specialists?Pain Manag20144427728325300385
  • WagnerLISchinkJBassMBringing PROMIS to practice: brief and precise symptom screening in ambulatory cancer careCancer2015121692793425376427
  • ChienTWLinWSImproving inpatient surveys: web-based computer adaptive testing accessed via mobile phone QR codesJMIR Med Inform201641e826935793
  • GamperEMPetersenMAAaronsonNDevelopment of an item bank for the EORTC Role Functioning Computer Adaptive Test (EORTC RF-CAT)Health Qual Life Outcomes20161472
  • SwinkelsICHartDLDeutscherDvan den BoschWJDekkerJde BakkerDHvan den EndeCHComparing patient characteristics and treatment processes in patients receiving physical therapy in the United States, Israel and the Netherlands: cross sectional analyses of data from three clinical databasesBMC Health Serv Res20088163
  • SwinkelsICvan den EndeCHde BakkerDClinical databases in physical therapyPhysiother Theory Pract200723315316717558879
  • HaysRDMoralesLSReiseSPItem response theory and health outcomes measurement in the 21st centuryMed Care2000389 SupplII28II4210982088
  • DeutscherDHartDLStratfordPWDicksteinRConstruct validation of a knee-specific functional status measure: a comparative study between the United States and IsraelPhys Ther20119171072108421596960
  • HartDLCookKFMioduskiJETealCRCranePKSimulated computerized adaptive test for patients with shoulder impairments was efficient and produced valid measures of functionJ Clin Epidemiol200659329029816488360
  • HartDLWangYCStratfordPWMioduskiJEComputerized adaptive test for patients with knee impairments produced valid and responsive measures of functionJ Clin Epidemiol200861111113112418619788
  • HartDLWangYCStratfordPWMioduskiJEComputerized adaptive test for patients with foot or ankle impairments produced valid and responsive measures of functionQual Life Res20081781081109118709546
  • HartDLWangYCStratfordPWMioduskiJEA computerized adaptive test for patients with hip impairments produced valid and responsive measures of functionArch Phys Med Rehabil200889112129213918996242
  • HartDLWernekeMWWangYCStratfordPWMioduskiJEComputerized adaptive test for patients with lumbar spine impairments produced valid and responsive measures of functionSpine (Phila Pa 1976)201035242157216420595928
  • HaysRDSherbourneCDMazelRMThe RAND 36-Item Health Survey 1.0Health Econ1993232172278275167
  • LordFMApplications of Item Response Theory to Practical Testing ProblemsHillsdale, NJLawrence Erlbaum Associates1980
  • ReiseSPBonifayWEHavilandMGScoring and modeling psychological measures in the presence of multidimensionalityJ Pers Assess201395212914023030794
  • R: A Language and Environment for Statistical Computing [R version 325] [computer program]Vienna, AustriaR Foundation for Statistical Computing2016
  • ReiseSPScheinesRWidamanKFHavilandMGMultidimensionality and structural coefficient bias in structural equation modeling: a bifactor perspectiveEdu Psychol Meas2012731526
  • Mplus User’s GuideSeventh Edition, version 7.4 [computer program]Los Angeles, CAMuthén & Muthén;19982015
  • ReeveBBHaysRDBjornerJBPsychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS)Med Care2007455 Suppl 1S22S3117443115
  • SamejimaFEstimation of latent ability using a response pattern of graded scoresETS Res Bull Ser1968i169
  • PARSCALE: IRT item analysis and test scoring for rating-scale data, version 41 [computer program]Chicago, ILScientific Software International2003
  • KosinskiMBjornerJBWareJEJrSullivanEStrausWLAn evaluation of a patient-reported outcomes found computerized adaptive testing was efficient in assessing osteoarthritis impactJ Clin Epidemiol200659771572316765275
  • McHorneyCACohenASEquating health status measures with item response theory: illustrations with functional status itemsMed Care2000389 SupplII435910982089
  • DunnettCWA multiple comparison procedure for comparing several treatments with a controlJ Am Stat Assoc19555027210961121