Search in:

Patient Preference and Adherence Volume 5, 2011 - Issue

Submit an article Journal homepage

Open access

396

Views

CrossRef citations to date

Altmetric

Listen

Review

The problem with health measurement

Stefan J CanoClinical Neurology Research Group, Peninsula College of Medicine and Dentistry, Tamar Science Park, Plymouth, UKCorrespondence[email protected]

Jeremy C HobartClinical Neurology Research Group, Peninsula College of Medicine and Dentistry, Tamar Science Park, Plymouth, UK

Pages 279-290 | Published online: 14 Jun 2011

Cite this article

In this article

Introduction
Key concepts
Rating scales in health measurement: a brief history
Rating scales in health measurement: type and kind
Psychometrics in health measurement: a brief history
Psychometric methods
Problem: our understanding of exactly what rating scales are measuring is limited
Can we solve the problem?
Conclusion
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF

Abstract

In this review we discuss health measurement with a focus on psychometric methods and methodology. In particular, we examine some of the key issues currently facing the use of clinician and patient rating scales to measure the health outcomes of disease and treatment. We present three key facts and flag one crucial problem. First, the numbers generated by scales are increasingly used as the measurements of the central dependent variables upon which clinical decisions are frequently made. The rising profile of rating scales has significant implications for scale construction, evaluation, and selection, as well as for interpreting studies. Second, rating scale science is well established. Therefore, it is important to learn the lessons from those who have built and established the science over the last century. Finally, the goal of a rating scale is to measure. As such, over the last half century, developments in rating scale (psychometric) methods have caused a refocus in the way we should be measuring health. In particular, newer methods have significant clinical advantages over traditional approaches. These should be seriously considered for inclusion in everyday practice. This leads us to the central problem with health measurement, which is that we cannot currently be sure what most rating scales are measuring. This is because the methods we have in place to ensure the validity of rating scales fall short of what is actually required. We expand on this point, and provide some potential routes forward to help address this important problem.

Keywords:

patient-reported outcome instruments
health-related quality of life
psychometrics
questionnaires
outcome assessment
health care

Introduction

Health measurement is increasingly at the heart of the agenda for high-stakes clinical research, trials, and practice,^Citation1^–^Citation3 which directly influences decisions about patient care and policy-making.^Citation4 This rise in profile has been accompanied by an increased interest in rating scale science.^{Citation2,Citation3} There are now growing numbers of clinical researchers who are either developing or using rating scales to quantify the effects of disease or treatment on abstract concepts, such as ability, emotional well-being, or memory. For example, the MAPI Trust, a nonprofit organization providing information on patient rating scales, houses over 3000 scales.^Citation5

Over the last 16 years we (SC, JH) have worked as health measurement researchers. We have been fortunate enough to have been involved in a wide range of clinical^{Citation6,Citation7} and surgical^{Citation8,Citation9} areas, have tested and developed a number of clinician-report^{Citation10,Citation11} and patient-report rating scales,^{Citation12,Citation13} and have used traditional and modern rating scale techniques.^Citation14 Our main interest lies in the science that underpins health measurement, also known as psychometrics.^Citation15 During our working careers, we have witnessed great progress relating to the application of psychometrics to the development of rating scales, and the development of documents containing key guidelines^{Citation16,Citation17} and high-level requirements.^{Citation2,Citation3}

However, we have also witnessed concerning problems in the field. Thus, despite the proliferation of rating scales in health measurement, many scales have not been psychometrically validated in an appropriate way.^Citation18^–^Citation22 This has wide-reaching effects. For example, despite the increased inclusion of rating scales in current “state-of-the-art” clinical research and trials, the same studies continue to use scales that have been proved to be scientifically wanting. This is demonstrated through even the most superficial of literature reviews, ie, a brief literature search in PubMed focusing on randomized controlled Phase III and IV trials in multiple sclerosis published in 2006–2011. This reveals that half of the 28 relevant articles used a rating scale, but only two articles include scales that have any supporting psychometric evidence. Parallels can be seen throughout neurology,^{Citation11,Citation23} and our experience working in other clinical disciplines suggests that these problems are not uncommon.

Given the increasing importance of rating scale data, we strongly believe that rating scales should provide scien-tifically robust results. However, the problem with health measurement runs deeper than psychometric “validation”. In order to understand why, we need to step back initially and provide some background and context. So, in this review, we explore health measurement, beginning with key concepts, followed by some important historical landmarks, then move on to the development and application of psychometric methods, finishing with some of the pressing issues of the current time. Health measurement covers a lot of ground. Of course it would be impossible to discuss all aspects of the area. So, before we get started, it is important to clarify what we will not be discussing here, but, given the omissions, why we believe our title is appropriate.

First, we do not include discussions on health economics, clinimetrics, or specific aspects of psychometric testing. In relation to health economics, the extent to which this falls under the remit of health measurement per se is debatable, but more importantly, this in itself is a large area that deserves its own review. For those interested in our views, we discuss health econometrics more fully elsewhere.^Citation9

In relation to clinimetrics, we would point readers to another of our publications, in which we provide a perspective on Feinstein’s contribution to the health measurement debate.^Citation23 For now, we would say that in this review we focus on the “measurement” part of health measurement. In particular, we discuss rating scales when they are used as measurement instruments to quantify variables of interest (eg, ability, depression, short-term memory) via patient self-report or clinician report. We do not discuss rating scales when they are used for other purposes, such as checklists, clinical assessment tools, methods of predicting outcome, structured interviews, or other methods for gathering information (eg, surveys). This is because terms such as evaluation, assessment, and measurement are often used interchangeably. However, measurement has a very specific meaning with respect to quantifying attributes (ie, a characteristic, or property belonging to a person).^Citation24 In contrast, evaluation and assessment are often qualitative processes.

Finally, we do not include a review (or appraisal) of specific psychometric tests, because once again this deserves its own review, given the size of the area and the issues. For those readers who would like to learn more, we have previously published a monograph that examines, in detail, the key tests used in traditional and modern psychometric methods.^Citation14

Why then, given that health measurement encompasses such a wide area, and has potentially many good and bad points, do we believe that our title is appropriate? In order to answer this question we must anticipate the punch line of our review. Thus, we believe that the cornerstones of health measurement are the instruments used to measure the target variables of interest. For these instruments to be fit for purpose they must provide clinically useful, meaningful, and interpretable data. We argue that, at the present time, the extent to which the vast majority of currently available scales achieve these vital criteria is unclear at best. This presents a “house of cards” situation, ie, if we are unclear as to the exact variables that our scales are measuring, what exactly can we do with the information they provide? We would suggest this fundamental issue has serious repercussions for the whole of health measurement. However, before we expand on this, we first need to revisit some key concepts to set the scene.

Key concepts

Rating scales are used to measure unobservable (latent) variables known as theoretical constructs, which are abstract (as opposed to concrete).^Citation25 Latent variables can be measured indirectly by asking questions intended to capture, empirically, the essential meaning of a construct. The simplest way to do this is to ask a single straightforward question, or item. However, single items are limited because they are: unlikely to represent the broad scope of a complex theoretical construct; likely to be interpreted in many different ways by respondents; imprecise because they cannot discriminate, to a fine degree, between different levels of an attribute; and unreliable (prone to random error) because they do not produce consistent answers over time.^Citation26 As such, rating scales are usually made up of multiple items, in which each item addresses a different aspect of the same underlying construct. Using multiple items overcomes the scientific limitations of single items because: more items increase the scope of a scale; are less open to variable interpretation; enable better precision; and improve reliability by allowing random errors of measurement to average out.^Citation26 In this review, we use the term “rating scale” as the umbrella term to cover any instrument that conforms to a questionnaire-style structure, and is used to obtain scores, from a person’s responses to statements or questions, which in turn are considered to be measurements of a given variable.

There are many methods, termed scaling models, for combining multiple items into scales, depending on the purpose the resulting scale is to serve.^Citation27^–^Citation31 The most widely used scaling model in health measurement is the method of summated ratings proposed by Likert.^{Citation32,Citation33} Four characteristics constitute a summated rating scale. First, there are multiple items whose scores are summed, without weighting, to generate a total score. Second, each item measures a property that can vary quantitatively. Third, each item has no right answer. Fourth, each item in the scale can be rated independently. Examples of Likert scales used in health measurement include the Medical Outcomes Study 36-item Short Form Health Survey (SF-36),^{Citation34,Citation35} General Health Questionnaire (GHQ),^Citation36 and the Hospital Anxiety and Depression Scale (HADS).^Citation37 The way in which developers propose that items should be combined to form a scale is called a measurement model. These models are the focus of a psychometric evaluation.

Rating scales in health measurement: a brief history

We have come a long way since Ernest Amory Codman’s “end result” idea.^Citation38 Codman was an orthopedic surgeon at the Massachusetts General Hospital, Boston, MA, during the first three decades of the 20th century.^Citation39 His “end result idea” entailed long-term follow-up of patients to determine treatment success, and taking steps to prevent new failures if outcomes were undesirable. Although Codman has been described as one of the most important figures in the history of clinical outcomes research, the conception and development of his “idea” have been largely neglected in the history of health measurement.^{Citation38,Citation39} It was not until after the Second World War that clinical researchers began to develop scales to measure the outcomes of procedures.

One of the first surgeons to do this was Visick, who attempted to measure the functional results of gastric surgery, focusing particularly on postprocedural complications.^Citation40 In 1949, Karnofsky, an oncologist, developed the first “performance” measure,^Citation41 ie, a 10-point observer-rated scale spanning the extremes of physical dependency defined by nursing burden. For many years, this scale was used widely, but often, it has been argued, inappropriately.^Citation42 It was improved 20 years later with Katz’s Activities of Daily Living Scale, which broadened the focus to wider aspects of quality of life.^Citation43 The same period saw an increase in the development and use of new scales across medicine, with the most noticeable increase in neurology.^Citation44 The decades following the 1960s witnessed increasing recognition of the importance of assessing a broader array of outcomes when measuring the impact of disease or evaluating the effectiveness of procedures.

During the 1970s, the focus of health care evaluation moved from traditional clinical outcomes (ie, mortality and morbidity) to the measurement of function (ie, the ability of patients to perform activities of daily living).^Citation25 The shift from traditional outcome measures to the wider encompassing measurement of health occurred for a number reasons. First, the narrow definition of health in terms of morbidity and mortality was replaced by a broader definition of health as a “complete state of physical, mental and social well-being and not merely the absence of disease or infirmity”.^Citation45 Second, public health campaigns, rising standards of living, ageing populations, and development of health technology led to a shift in attention from the cure of acute diseases to the management of more complex, chronic conditions (eg, asthma, rheumatoid arthritis, multiple sclerosis). This led to increased interest in measuring more complex and subjective aspects of outcomes pertaining to the health impact of disease and/or treatment (for which we use the shorthand term “health outcomes” in this review). Third, there was increased demand for clinicians to demonstrate evidence of cost-effectiveness, in which the benefits of a particular health service or intervention are weighed against the costs of that service or intervention.^Citation46

The 1980s witnessed patient report rating scales (now known as Patient Reported Outcome [PRO] instruments) being increasingly used in clinical research, and as a result, phrases such as “quality of life” became buzz words.^Citation47 Scales for use across different clinical populations (generic measures) were developed and became widely used, including the Sickness Impact Profile,^Citation48 Nottingham Health Profile,^Citation49 and SF-36.^Citation50 The 1990s saw a proliferation of more targeted patient rating scales, including dimension-specific (eg, mood^Citation37), disease-specific (eg, cancer^Citation51), site-specific (eg, orthopedic^Citation52), and individualized scales.^Citation53 The gradual but important shift from clinical research to practice and policy^Citation2^–^Citation4 over the last decade has witnessed the proposal of even more sophisticated measuring instruments in the form of item banks.^Citation54^–^Citation56

Rating scales in health measurement: type and kind

Philosophically, the different types of rating scales can be classified into two distinct approaches.^{Citation57,Citation58} First, the standard needs approach describes measuring health outcomes as the extent to which certain universal needs are met. This approach advocates that there is a standard set of life circumstances that are required for optimal functioning. Although subjective phenomena, health outcomes are objective characteristics of an individual. Second, and in contrast, the psychological processes approach views health outcomes as being constructed from individual evaluations of personally salient aspects of life. This approach sees health outcomes as being made up of perception of life circumstances, dependent on the psychological makeup of an individual, rather than on their life circumstances alone. The central assumption of this approach is that each person is the best source of judgments about health outcomes, and one cannot assume that all people will value different circumstances in the same way.

Many types of rating scales can be classed as following the standard needs approach, ranging from generic scales that provide comprehensive, general evaluations of health outcomes, to those that concentrate on a specific aspect of health (eg, symptoms). The former is illustrated by the SF-36,^Citation50 which focuses on activities of daily living (eg, personal care, domestic roles, mobility) and on role functioning (eg, work, finance, family, friends, and social). Generic measures permit direct comparisons of different patient populations, thereby providing the opportunity to make policy decisions across a variety of diseases.^Citation59 The use of generic measures may enhance the generalizability of a study or help interpret results in a wider context. In addition, it can be argued that generic measures are likely to be robust because they are used and tested in many different settings. However, generic measures may be limited because they are may be unable to address important aspects of outcome that are affected by a particular disease, and may not be sensitive enough to detect changes in outcome which occur in response to treatment or over time.^Citation60

There are three types of standard needs rating scales that concentrate on a more specific aspect of health, ie, disease/condition-specific, site-specific, and dimension-specific. The most commonly used of these scales are disease/condition-specific scales, which are developed for use in a specific disease or condition. These include items that are directly relevant to the condition and, therefore, are likely to be shorter and apparently more appropriate,^Citation59 which can help to reduce patient burden and increase acceptability.^Citation61 Disease-specific scales ensure more comprehensive assessment of important outcome domains, and are generally more sensitive in detecting the effects of treatment on outcome and changes in outcome over time.^Citation59

A site-specific scale focuses on health problems in a specific part of the body, such as the Oxford Hip Score.^Citation52 As with disease/condition-specific scales, these include fewer items and appear to be more appropriate, reducing patient burden and increasing acceptability.

A dimension-specific scale provides a comprehensive, general evaluation of one specific aspect of health, which may be applicable across different patient groups and treatments. Examples of these types of scale include the GHQ^Citation62 and HADS^Citation37 which focus on aspects of psychological well-being. The advantage of such measures is that they provide a more detailed assessment in the area of concern.

The main drawback of specific measures is that they do not allow comparisons between different patient groups. Therefore, it is argued that comprehensive assessment of outcome should include a combination of generic and specific measures.^{Citation59,Citation60} Generic measures allow comparisons across studies, thus enhancing the generalizability of findings, and specific measures provide better content validity, so are generally more responsive to measuring change due to greater relevance to the specific population.

In contrast to using generic or specific rating scales with predetermined content, proponents of the psychological processes approach argue that listing items in rating scales do not capture the subjectivity of human beings and the individual structure of values. In short, prescribing items using a preordained definition of health outcome (eg, quality of life) and matching the person to the definition (ie, “goodness of fit”), does not let us know whether all the domains, pertinent and meaningful to each respondent, are included. This viewpoint prompted the development of “individualized” measures, such as the Schedule for the Evaluation of Individual Quality Of Life (SEIQoL).^Citation53 The SEIQoL allows individuals to nominate important domains of quality of life and weight those domains in order of importance. Another, the Patient Generated Index (PGI), asks individuals to identify those aspects of life that are personally affected by health.^Citation63 The main advantage of these measures includes a claim for validity, given that the areas of importance are selected by the individuals involved in completing the measures. The main disadvantages are that some of these measures require trained interviewers, which translates into a need for greater resources and lower practicality. Also, it is less easy to compare data from individualized measures between patients due to the variation in each individual completed measure.^Citation64

Item banks can be viewed as very large “rating scales”, in which patients only complete a subset of targeted items. These banks capitalize on modern psychometric methods (which we describe more fully in the next section). In essence, modern methods provide rich information about item performance not available using traditional psychometric methods, that can be used to create banks of items (up to many hundreds or thousands of items) with known characteristics. New items can then be calibrated against the best available measures to obtain scales of higher quality and better precision.^Citation65 Item banking also makes it possible to carry out computer adaptive testing.^Citation66 In this technique, rather than giving the same set of items to each individual, the items are selected based on ability level or other characteristics. Computer adaptive testing has already been developed in many areas including migraine, combining datasets using different outcome measures.^Citation67

As alluded to in this last paragraph, the increased application of rating scales in health measurement has required the introduction of more advance psychometric methods. To elaborate on this, we first need to place these “newer” methods in context.

Psychometrics in health measurement: a brief history

Psychometrics was adopted as part of health measurement in the early 1980s.^Citation68^–^Citation70 However, its scientific foundations are deeply rooted in education and psychology. In fact, its origins can be traced to the mid 1800s when psychophysicists were demonstrating that subjective judgment can be used as a valid approach to measurement.^{Citation71,Citation72} Through the advent of the mental test movement (circa 1925–1960),^Citation30 these ideas were taken further and, as such, Thurstone proposed the “law of comparative judgment”, an approach with close connections to the psychophysical theory developed by Weber and Fechner. This demonstrated that psychophysical scaling methods could be used to measure psychological attributes accurately^{Citation27,Citation73} and prompted the development of psychological (or psychometric) scaling methods, which are defined as procedures for constructing scales for the measurement of psychological attributes.^Citation71 The mental test movement led to the widespread use of standardized tests (eg, educational achievement, attitudes and personality, personnel) and, at the same time, scientific interest in methods of testing led to the development of psychometrics as a prominent discipline in psychology, within which were established the cornerstones of the scientific evaluation of measures.^{Citation71,Citation74}

As explained above, since the 1970s health care evaluation has moved towards the measurement of physical, psychological, and social functioning.^Citation25 The importance of psychometric methods for measuring health variables was demonstrated by two related key studies conducted in the US. First, the Health Insurance Experiment^Citation75 showed that psychometric methods could be used to generate reliable and valid measures for assessing changes in health status for both adults and children in the general population. Second, the Medical Outcomes Study^{Citation25,Citation76} showed that psychometric methods of scale construction and data collection were successful for measuring health status in samples of sick and elderly people. Since then, the use of psychometrics has proliferated throughout health measurement.

Psychometric methods

The main psychometric approaches as related to health measurement have been classical test theory and, more recently, Rasch measurement models and item response theory. Of all three approaches, classical test theory is currently the dominant paradigm.

Classical test theory

Spearman laid down the foundations of classical test theory in 1904, when he introduced the decomposition of an observed score into a true score and an error, and showed how to estimate the reliability of observed scores.^Citation77 It took a further 50 years before the role of classical test theory analyses became clearer^Citation78 as an accumulation of statistical evidence to establish the scientific robustness of measures (eg, Kuder-Richardson’s coefficients for internal inconsistency, Cronbach’s alpha, correlations between replicated measurements). Classical test theory is grounded in the definition of measurement as proposed by Stevens (ie, “the assignment of numerals to objects or events according to some rule”).^Citation79 It is important to note that this definition differs in important respects from the more classical definition of measurement adopted throughout the physical sciences, which is that measurement is the numerical estimation and expression of the magnitude of one quantity relative to another.^Citation80 Classical test theory is based upon analyses of raw scores that are used to test the assumptions underlying a given measurement model, ie, that the items can be summed (without weighting or standardization) to produce a score. The key traditional measurement properties that should be considered are data quality, scaling assumptions, targeting, reliability, validity, and responsiveness. We and others describe these tests in more detail elsewhere.^{Citation2,Citation14}

Rasch measurement methods

Georg Rasch, a Danish mathematician, was principally concerned with the measurement of individuals rather than distribution of levels of a trait in a population. He argued that the core requirement of social measurement should be the same as that in physical measurement (ie, “invariant comparison”). With this in mind, he developed the simple logistic model (now known as the Rasch model) and through applications in education and psychology, he was able to demonstrate that his approach met the stringent criteria for measurement used in the physical sciences.^Citation81 Vitally, the Rasch paradigm differs from the traditional statistical modeling paradigm, in that the latter approach is used to describe a set of data, whereas the former aims to obtain data which fit the model.^Citation82

In the Rasch model, the probability of a specified response to a given item (eg, “yes”/“no”) is modeled as a logistic function of the difference between the person and item parameter (ie, the higher a person’s ability with respect to the difficulty of an item, the higher the probability of a correct response). When applying the Rasch model, item locations are scaled first in a process known as “item calibration”. Once item locations are scaled, the person locations are measured on the same scale. Each item and person estimate has an associated standard error of measurement, which quantifies the associated degree of uncertainty.

Rasch measurement methods are able to transform ordinal summed scores into linear measurements by paired comparisons of any two persons, any two items, or any one person and one item, defined by the logarithm of the relative probabilities.^{Citation81,Citation83,Citation84} Essentially, observed scores are replaced by the expected probabilities of occurrence, and relative differences are computed as ratios of the relative probabilities (as these are consistent indicators of relative differences). This ratio of the relative probabilities is then expressed on a linear scale in an additive form by taking logarithms. In addition, the Rasch model is able to transform summed scores into linear measures of persons and items that are on the same scale with a common unit, and freed up from the distributional properties of each other. Thus, the Rasch model realizes, mathematically, the requirements for scientific measurement of invariant comparisons of people, and items, on the same linear scale.^Citation81 ^{Citation83,Citation84}

Rasch measurement methods use the Rasch model to evaluate the legitimacy of summing items to generate measurements, and their reliability and validity. The model articulates the set of requirements that must be met for rating scale data to generate internally valid, equal-interval measurements that are stable (invariant) across items and people.^Citation85 The central tenet of the Rasch measurement methods is that they examine the extent to which observed data (patients’ actual responses to scale items) accord with (“fit”) predictions of those responses from a mathematical (Rasch) model. Thus, the difference between what should happen (expected) and what does happen (observed) indicates the extent to which rigorous measurement is achieved. Statistical and graphical tests are used to evaluate the correspondence of data with the model. Certain tests are global, while others focus on specific items or persons. There are seven key measurement properties that should be considered, ie, thresholds for item response options, item fit statistics, item locations, differential item functioning, correlations between standardized residuals, person separation index, and individual person change statistics. We describe these in more detail elsewhere.^Citation14

Comparison of classical test theory and Rasch measurement

Direct comparisons of classical test theory and Rasch measurement methods in the medical literature are sparse, and at best superficial.^{Citation86,Citation87} In part, this may be due to the fact that the two approaches cannot be compared easily, because they use different methods, produce different information, and apply different criteria for success and failure.

There are four main limitations of classical test theory. First, the data generated are ordinal rather than interval, the invariance of which is unknown.^Citation85 Second, scores for persons and samples are scale-dependent because they lack the provision for varying item parameters, resulting in item parameters that must be regarded as fixed.^Citation88 Third, scale properties, such as reliability and validity, are sample-dependent. As such, the marginal probabilities of measures (ie, the probability distribution of scale scores) vary across population subgroups, because these subgroups may vary in the rate of the construct being measured.^Citation11 Fourth, the data are only suitable for group studies, and are not suitable for individual patient measurement.^Citation89

Rasch measurement methods address each of the four limitations of classical test theory. First, the approach offers the ability to construct linear measurements from ordinal-level rating scale data, thereby addressing a major concern of using rating scales as outcome measures.^{Citation90,Citation91} Second, Rasch measurement methods provide item estimates that are free from the sample distribution and person estimates that are free from the scale distribution, thus allowing for greater flexibility in situations where different samples or test forms are used.^Citation92 Therefore, the methods allow for the use of subsets of items from each scale rather than all items from the scale, yet are still able to compare scores using different sets of items. This is the foundation for item banking and computerized adaptive testing.^Citation66 Third, Rasch measurement methods enable estimates to be obtained suitable for individual person analyses rather than only for group comparison studies.^{Citation84,Citation93}

Criticisms of the Rasch model include it being overly restrictive because it does not permit each item to have a different discrimination and because there is no provision in the model for other parameters (eg, guessing). Some also suggest that this model is also limited by the need for unidimensional data and is too simple to match the complexity of human behavior. Further, it is complex, and classical test theory test scoring procedures are simpler to compute.^{Citation86,Citation94}^–^Citation96

Item response theory and Rasch measurement

Item response theory is another body of psychometric analysis that provides a foundation for statistical estimation of parameters that represent the locations of persons and items on a latent continuum.^Citation97 In particular, item response theory analyses are used to ascertain the degree to which a given model and parameter estimates can account for the structure of and statistical patterns within a response dataset.^{Citation82,Citation97} Rasch measurement methods and item response theory are mathematically similar and, therefore, are often considered as members of the same family of statistical techniques.^{Citation82,Citation98} This is inaccurate because practitioners of Rasch measurement methods and item response theory have different research agendas.^{Citation23,Citation82,Citation98}

The distinction between Rasch measurement methods and item response theory is subtle but important. Item response theory models are statistical models used to explain data, and as such, the aim of an item response theory analysis is to find the statistical model that best explains the observed data.^{Citation82,Citation98} When the observed data do not fit the chosen item response theory model, we seek another model to explain the data better. In contrast, Rasch measurement methods provide a mathematical model for guiding the construction of stable linear measures from rating scale data.^Citation81 Therefore, the aim of Rasch measurement methods is to determine the extent to which observed rating scale data satisfy the measurement model. When the data do not fit the model, we examine the data carefully to try and explain the misfit, but ultimately we choose data that satisfies the model’s requirements. This is the central tenet of the Rasch model that distinguishes it from item response theory models. Specifically, its defining property is its mathematical embodiment of the principle of invariant comparison.

The above discussion invokes two questions, ie, which approach is better and does it matter which approach is used? The answers to both questions depend on which central philosophy is followed, because this divides proponents of item response theory and Rasch measurement. Because item response theory prioritizes the observed data, it sees the Rasch perspective of using only one model as too restrictive, and the “selection” of data to meet that model as threatening to content validity.^{Citation99,Citation100} Because Rasch measurement prioritizes the mathematical model, it sees the process of modeling data as precluding the ability to achieve core requirements of measurement, too accepting of poor quality data, and threatening to construct validity. Not surprisingly, it has been suggested that item response theory and Rasch measurement have irreconcilable differences,^Citation101 and the two groups have come into conflict regarding which approach is preferable.^{Citation82,Citation102}^–^Citation104

Problem: our understanding of exactly what rating scales are measuring is limited

We hope that, in the previous sections, we have made the case for the strong scientific basis that underpins the area and the progress that has been made, especially over the last 50 years. We also hope that we have illustrated some of the potential pitfalls, especially in the selection of appropriate scales and use of appropriate psychometric methods. In fact, it is our experience that the most common disagreements in health measurement surround the issues of methods and methodology. We also expect that the debate surrounding the relative merits of competing psychometric approaches will continue. This is an issue for health measurement but, over time, and with enough discussion and clarification, we hope that this situation will improve. However, in our opinion, there is a more pressing and fundamental problem that needs to be addressed in health measurement.

The rise in profile of health measurement requires rating scales that measure the health constructs they purport to measure (ie, are valid), and health constructs that are clinically meaningful and interpretable. Unfortunately, the current methods of establishing rating scale validity rarely enable these goals to be confirmed, because they lack formal methods for defining and testing construct theories.^Citation105 This situation has arisen, in part, because the constructs measured by many scales are determined during their development.

Typically, scale developers generate a large pool of items, group them into potential scales, either statistically or thematically, decide what construct each group seems to measure, and then remove unwanted or irrelevant items. The main limitation of this approach is that the scale content, rather than the construct intended for measurement, defines what the scale measures. Neither grouping items statistically, nor thematically, ensures that the items in a group measure the same construct. Furthermore, both methods of grouping items do not adequately address the issues of defining, conceptualizing, and operationalizing constructs, which are central to valid measurement.^Citation106^–^Citation109 Even if the circumstances were different, and scales were underpinned by explicit construct theories, standard methods of validity testing would not enable those theories to be tested adequately. Why? Because current methods, which integrate evidence from nonstatistical and statistical tests, provide circumstantial evidence at best that a set of items is measuring a specific construct.

Nonstatistical tests of validity typically consist of assessments of content and face validity. Content validation assesses whether scale development has sampled all the relevant or important content or domains,^Citation110 uses “sensible methods of scale construction”, and a “representative collection of items”.^Citation111 Face validation assesses whether the final scale looks, on the face of it,^Citation110 like it measures what is intended.^Citation111 Over 50 years ago, Guilford named these evaluations “validity by assumption” and “faith validity”,^Citation71 yet they remain essentially unchallenged, except, perhaps for Feinstein’s contribution of clinimetrics.^Citation24

Statistical tests of scale validity are more formal than their nonstatistical counterparts, but remain weak evaluations of the extent to which a set of items measures a construct. For example, statistical examinations of internal construct validity (eg, factorial validity^Citation112 and internal consistency^Citation113) test the extent to which the items of a scale are related statistically. This does not confirm that a set of items marks out a clinically meaningful variable of interest, let alone tell us what a scale measures.

Statistical tests of external construct validity consist of a range of examinations (including correlations with other measures,^{Citation114,Citation115} testing known group differences,^Citation116 and hypothesis testing^{Citation113,Citation114}) which assess the extent to which scale scores “behave” as predicted, and seek to determine if a scale “does what it is intended to do”.^Citation74 The examination considered to provide the strongest statistical evidence of scale validity is called convergent and discriminant construct validity.^Citation115 Here, a range of scales measuring similar and dissimilar constructs are administered to a sample. Their scores are correlated, and the pattern and magnitude of correlations are examined to determine if the scale being validated correlates better with scales measuring similar constructs than dissimilar constructs. The limitation of this approach is that showing a scale does not correlate highly with measures of a dissimilar construct tells us nothing about what the scale actually measures. Similarly, showing that a scale correlates highly with measures of similar constructs only tells us that the two are related.

A key problem with all statistical tests of validity is that they focus on person scores and between-person variation in these scores. They are weak because there is no independent means of assessing the extent to which the intention of the scale is attained.^Citation117 Consequently these validation techniques entail circular reasoning,^Citation117 generate only circumstantial evidence of validity,^Citation98 enable limited development of construct theories, and result in “primitive” understandings of exactly what is being measured.^Citation105 Like their nonstatistical counterparts, they have remained essentially unchallenged for decades.

Can we solve the problem?

Encouragingly, PRO guidelines, such as the current scientific requirements of the US Food and Drug Administration (FDA) for patient-reported rating scales in clinical trials,^{Citation2,Citation118} highlight the importance of establishing validity. In particular, the FDA emphasizes appropriate conceptual frameworks and definitions as being fundamental. However, the FDA document provides little detailed guidance on how these can be achieved, largely because the field is poorly developed. We would argue that greater use of qualitative assessments is vital, and should include evaluating the extent to which the items of a scale map out the construct to be measured, establishing the most appropriate item phrasing, structuring and context, and cognitive debriefing to ensure consistency in meaning. In particular, we advocate the use of inductive and deductive approaches to develop explicit theories of the constructs being measured, and explicit methods of testing those theories.^{Citation105,Citation117,Citation119}

Rating scale development would benefit from being “bottom-up” (from a construct definition), rather than “top-down” (from a method of grouping items) to ensure that a substantive construct theory determines scale content, and validation tests construct theories. This would require the development of robust guidelines for defining constructs and explicit definitions for content and face validity. Rating scale evaluation should fully acknowledge the equally important and complementary roles of qualitative and quantitative evaluations. In fact, scale evaluation could be considered under these two headings. The aim of qualitative evaluation could be defined as determining the extent to which the items of a scale map out a construct as a clinically meaningful continuum and, when available, the extent to which construct theory is supported. The aim of quantitative evaluation could be defined as determining the extent to which the numbers generated by scales are measurements rather than numerals.

This analysis of scale validity implies that two things are needed, ie, explicit theories of the constructs being measured and explicit methods of testing those theories. Over the last 25 years, one group outside of health measurement has developed these ideas to an advanced level.^{Citation105,Citation117,Citation119} This group, led by Stenner, has argued for a change in focus of assessing validity from studying the people to the items,^Citation105 and in particular the relationships between item characteristics and item scores. This forms the building blocks of the theory of the construct, and the validity of the construct theory becomes established when it predicts variation in item scales values. Stenner asks three key questions: Why are items ordered in a particular way? How can we explain variation in item scores, (ie, item difficulty)? What is the “something” that causes this variation?

The approach of Stenner et al is illustrated by their Lexile framework for measuring people’s reading ability.^Citation119 The reading ability continuum is mapped out by a set of items, each of which is a passage of reading text with different levels of readability (reading difficulty). People’s responses to the items are scored to give a measure of their reading ability. The Lexile framework was constructed using Rasch measurement methods, thus people are measured in linear units (called Lexiles), and legitimate individual person measurement is possible. Theory suggests that the reading difficulty of a passage of text (item difficulty) is determined by two characteristics, ie, the frequency of the words as they are used in everyday written and oral communications, and the length of the sentences. These two variables combine in the form of a construct specification equation that consistently explains more than 80% of the variation in text difficulty.^Citation119 Thus, empirical evidence strongly supports the construct theory. Stenner calls this approach “theory-referenced measurement”.^Citation119 We provide more detail about his work elsewhere.^Citation23

There are currently no examples of scales developed using theory-referenced measurement in health measurement, but it would not be hard to imagine instances where we could apply this approach. One example could be measuring the impact of disability. We would argue that it should be possible to take any aspect of impact (eg, upper limb functioning), and ask the same questions as Stenner’s group. Thus, why are upper limb physical functioning items ordered and separated as they are? What specific item characteristics (eg, task variables) determine item difficulties (eg, task abilities)? We could identify the motor components of tasks that may characterize a theory of upper limb functioning, and examine items to identify their characteristics (variables) that account for these task difficulties. In doing so, we would begin to assemble the building blocks of a new construct theory and then move towards an appropriate construct specification equation.

Conclusion

In a 1997 editorial, Sonja Hunt, codeveloper of one of the first generic measures, ie, the Nottingham Health Profile,^Citation49 warns us about the dangers of using quality of life instruments for decision-making: “From the perspective of scientific method it seems that there is a considerable way to go before any of the existing models or ‘theories’ can be considered definitive enough to justify application in the lives of patients ... where the results may be used to guide decision-making in the real world is not only unscientific, it is unethical”. ^Citation47

Fourteen years later, we find ourselves in a position where the field now stretches far beyond quality of life, into all aspects of health, and clinician-report and patient-report rating scales are being used as part of the patient decision-making process. However, in terms of the application of scientific methods to ensure that we have a clear understanding of what we are measuring, much less progress has been made. Thus, whereas we feel the intention behind the use of rating scales as health measurement tools in high stakes decision-making is well meant, we believe that there is a way to go before we can be confident that these tools are providing accurate information about their target constructs. The potential consequences in terms of rating scales misguiding patient care and misleading research, we believe, are under-appreciated by clinicians and researchers.

Although construct specification equations are some way off, a move towards developing consensus guidelines to strengthen the theoretical underpinnings of new scales and the evaluation of existing scales would benefit health measurement. In particular, we would like to see greater use of qualitative assessments including: the adoption of inductive and deductive approaches to construct theory building and development; evaluations of the extent to which the items of a scale mark out the construct to be measured; establishing the most appropriate item phrasing, structuring, and context; and cognitive debriefing to ensure consistency in meaning.

We have two key messages from our review. First, clinical researchers should be aware that there is a wealth of information regarding psychometrics out there. However, considered in isolation, psychometric statistics can be misleading. They cannot be expected to produce consistently meaningful results when considered apart from qualitative scale content evaluations. Second, establishing clinically meaningful content validity from the onset by defining, conceptualizing, and operationalizing the constructs intended to be measured is a vital step. Unfortunately, in health measurement, such strong conceptual underpinnings and therefore explicit construct theories are uncommon,^Citation47 and clinicians, researchers, and policy makers should bear this in mind when engaging with health measurement at all levels. Stenner et al use the following analogy to describe a construct theory: “The story we tell about what it means to move up and down the scale for a variable of interest (eg, temperature, reading, ability, short-term memory). Why is it, for example, that items are ordered as they are on the item map? [This] story evolves as knowledge increases regarding the construct”.^Citation119 We would suggest that we need to be able to tell clearer and more detailed stories about what underpins our rating scales before we can start to use them confidently to make decisions about patient’s lives.

Disclosure

The authors report no conflicts of interest in this work.

References

DarziAHigh Quality Care for All: NHS Next Stage Review Final ReportLondon, UKDepartment of Health2008
Google Scholar
Food and Drug AdministrationPatient reported outcome measures: Use in medical product development to support labelling claims Available from: www.fda.gov/cber/gdlns/prolbl.pdf. Accessed May 17, 2011.
Google Scholar
Food and Drug AdministrationQualification process for drug development tools Available from: http://www.fda.gov/cder/guidance/index.htm. Accessed May 17, 2011.
Google Scholar
Department of HealthEquity and Excellence: Liberating the NHSLondon, UKHer Majesty’s Stationery Office2010
Google Scholar
MAPI Trust Available from: http://www.mapi-trust.org/about-the-trust. Accessed May 17, 2011.
Google Scholar
HobartJLampingDThompsonAEvaluating neurological outcome measures: The bare essentialsJ Neurol Neurosurg Psychiatry1996601271308708638
PubMed Web of Science ®Google Scholar
HobartJFreemanJThompsonAKurtzke scales revisited: The application of psychometric methods to clinical intuitionBrain20001231027104010775547
PubMed Web of Science ®Google Scholar
CanoSKlassenAPusicAThe science behind quality-of-life measurement: A primer for plastic surgeonsPlast Reconstr Surg200912398e106e19116542
PubMed Web of Science ®Google Scholar
CanoSKlassenAScottAThomaAFeenyDPusicAHealth outcome and economic measurement in breast cancer surgery: Challenges and opportunitiesExpert Rev Pharmacoecon Outcomes Res20101058359420950073
PubMed Web of Science ®Google Scholar
CanoSHobartJHartPKoliparaLSchapiraACooperJThe International Co-operative Ataxia Rating Scale (ICARS): An appropriate rating scale for Friedreich’s ataxiaMov Disord2005201585159116114019
PubMed Web of Science ®Google Scholar
CanoSPosnerHMolineMThe ADAS-cog in Alzheimer’s disease clinical trials: Psychometric evaluation of the sum and its partsJ Neurol Neurosurg Psychiatry2010811363136820881017
PubMed Web of Science ®Google Scholar
HobartJLampingDFitzpatrickRRiaziAThompsonAThe Multiple Sclerosis Impact Scale (MSIS-29): A new patient-based outcome measureBrain200112496297311335698
PubMed Web of Science ®Google Scholar
CanoSBrowneJLampingDRobertsAMcGroutherDBlackNThe Patient Outcomes of Surgery-Head/Neck (POS-Head/Neck): A new patient-based outcome measureJ Plast Reconstr Aesthet Surg200659657316482791
PubMed Web of Science ®Google Scholar
HobartJCanoSImproving the evaluation of therapeutic intervention in MS: The role of new psychometric methodsHealth Technol Assess2009131200
Web of Science ®Google Scholar
StreinerDNormanGHealth Measurement Scales: A Practical Guide to their Development and Use4th edOxford, UKOxford University Press2008
Google Scholar
Scientific Advisory Committee of the Medical Outcomes TrustAssessing health status and quality of life instruments: Attributes and review criteriaQual Life Res20021119320512074258
PubMed Web of Science ®Google Scholar
MokkinkLTerweeCPatrickDThe COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: An international Delphi studyQual Life Res20101953954920169472
PubMed Web of Science ®Google Scholar
CanoSBrowneJLampingDPatient-based measures of outcome in plastic surgery: Current approaches and future directionsBr J Plast Surg20045711114672672
PubMedGoogle Scholar
CanoSHobartJLinacreJPatient-based outcomes of cervical dystonia: A review of rating scalesMov Disord2004191054105915372595
PubMed Web of Science ®Google Scholar
PusicALiuJChenCA systematic review of patient-reported outcome measures in head and neck cancer surgeryOtolaryngol Head Neck Surg200713652553517418246
PubMed Web of Science ®Google Scholar
KosowskiTMcCarthyCReaveyPA systematic review of patient-reported outcome measures after facial cosmetic surgery and/or nonsurgical facial rejuvenationPlast Reconstr Surg20091231819182719483584
PubMed Web of Science ®Google Scholar
ChenCCanoSKlassenAMeasuring quality of life in oncologic breast surgery: A systematic review of patient-reported outcome measuresBreast J20101658759721070435
PubMed Web of Science ®Google Scholar
HobartJCanoSZajicekJThompsonARating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendationsLancet Neurol200761094110518031706
PubMed Web of Science ®Google Scholar
FeinsteinAClinimetricsNew Haven, CTYale University Press1987
Google Scholar
StewartAWareJMeasuring Functioning and Well-being: The Medical Outcomes Study ApproachDurham, NCDuke University Press1992
Google Scholar
NunnallyJPsychometric Theory2nd edNew York, NYMcGraw-Hill1978
Google Scholar
ThurstoneLA method for scaling psychological and educational testsJ Educ Psychol192516433451
Google Scholar
GuttmanLA basis for analysing test-retest reliabilityPsychometrika19451025528221007983
PubMed Web of Science ®Google Scholar
GulliksenHTheory of Mental TestsNew York, NYWiley1950
Google Scholar
TorgersonWTheory and Methods of ScalingNew York, NYJohn Wiley and Sons1958
Google Scholar
EdwardsATechniques of Attitude Scale ConstructionNew York, NYAppleton-Century-Crofts1957
Google Scholar
LikertRA technique for the measurement of attitudesArch Psychol1932140555
Google Scholar
LikertRRoslowSMurphyGA simple and reliable method of scoring the Thurstone attitude scalesJ Soc Psychol19345228238
Web of Science ®Google Scholar
WareJSnowKKosinskiMGandekBSF-36 Health Survey Manual and Interpretation GuideBoston, MANimrod Press1993
Google Scholar
WareJKosinskiMKellerSSF-36 Physical and Mental Health Summary Scales: A User’s ManualBoston, MAThe Health Institute, New England Medical Center1994
Google Scholar
GoldbergDManual of the General Health QuestionnaireWindsor, UKNFER-Nelson1978
Google Scholar
ZigmondASnaithRThe Hospital Anxiety and Depression ScaleActa Psychiatr Scand1983673613706880820
PubMed Web of Science ®Google Scholar
KaskaSWeinsteinJHistorical perspective. Ernest Amory Codman, 1869–1940. A pioneer of evidence-based medicine: The end result ideaSpine1998236296339530796
PubMed Web of Science ®Google Scholar
NeuhauserDErnest Amory Codman, M.D., and end results of medical careInt J Technol Assess Health Care199063073252203705
PubMedGoogle Scholar
VisickAA study of the failures after gastectomyAnn R Coll Surg Engl1948326628418112082
PubMed Web of Science ®Google Scholar
KarnofskyDAbelmannWCraverLBurchenalJThe use of nitrogen mustards in the treatment of carcinomaCancer19481634656
Web of Science ®Google Scholar
FraserSQuality-of-life measurement in surgical practiceBr J Surg1993801631698443640
PubMed Web of Science ®Google Scholar
KatzSDownsTCashHGrotzRProgress in development of the index of ADLGerontologist19761020305420677
PubMed Web of Science ®Google Scholar
HerndonRHandbook of Neurologic Rating ScalesNew York, NYDemos Medical Publishing2006
Google Scholar
World Health OrganisationConstitution of the World Health OrganisationGeneva, SwitzerlandWorld Health Organisation1948
Google Scholar
RobinsonRThe policy contextBr Med J19933079949968241915
PubMed Web of Science ®Google Scholar
HuntSMThe problem of quality of lifeQual Life Res199762052129226977
PubMed Web of Science ®Google Scholar
BergnerMBobbittRPollardWMartinDGilsonBThe Sickness Impact Profile: Validation of a health status measureMed Care1976145767950811
PubMed Web of Science ®Google Scholar
HuntSMcEwenJMcKennaSMeasuring Health StatusLondon, UKCroom Helm1985
Google Scholar
WareJSherbourneDThe MOS 36-Item Short-Form Health Survey (SF-36): I. Conceptual framework and item selectionMed Care1992304734831593914
PubMed Web of Science ®Google Scholar
AaronsonNAhmedzaiSBergmanBThe European Organization for Research and Treatment of Cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncologyJ Natl Cancer Inst1993853653768433390
PubMed Web of Science ®Google Scholar
DawsonJFitzpatrickRMurrayDCarrAComparison of measures to assess outcomes in total hip replacement surgeryQual Health Care19965818810158596
PubMedGoogle Scholar
O’BoyleCMcGeeHHickeyAJoyceCBrowneJO’MalleyKThe Schedule for the Evaluation of Individual Quality of Life (SEIQoL): Administration ManualDublin, IrelandRoyal College of Surgeons in Ireland1993
Google Scholar
RevickiDCellaDHealth status assessment for the twenty-first century: Item response theory, item banking and computer adaptive testingQual Life Res199765956009330558
PubMed Web of Science ®Google Scholar
FriesJFCellaDRoseMKrishnanEBruceBProgress in assessing physical function in arthritis: PROMIS short forms and computerized adaptive testingJ Rheumatol2009362061206619738214
PubMed Web of Science ®Google Scholar
GarciaSCellaDClauserSBStandardizing patient-reported outcomes assessment in cancer clinical trials: A patient-reported outcomes measurement information system initiativeJ Clin Oncol2007255106511217991929
PubMed Web of Science ®Google Scholar
BrowneJMcGeeHO’BoyleCConceptual approaches to the assessment of quality of lifePsychol Health199712737751
Web of Science ®Google Scholar
BowlingAMeasuring Health: A Review of Quality of Life Measurement Scales3rd edMilton Keynes, UKOpen University Press2005
Google Scholar
BergnerMHealth status measures: An overview and guide for selectionAnnu Rev Public Health198781912103555521
PubMed Web of Science ®Google Scholar
PatrickDDeyoRGeneric and disease-specific measures in assessing health status and quality of lifeMed Care1989273 SupplS217S2322646490
PubMed Web of Science ®Google Scholar
FletcherAGoreSJonesDFitzpatrickRSpiegelhalterDCoxDQuality of life measures in health care. II: Design, analysis, and interpretationBr Med J1992305114511481463954
PubMedGoogle Scholar
GoldbergDHillierVA scaled version of the General Health QuestionnairePsychol Med19799139145424481
PubMed Web of Science ®Google Scholar
RutaDGarrattALengMRussellIMacDonaldLA new approach to measurement of quality of life: The patient-generated indexMed Care199432110911267967852
PubMed Web of Science ®Google Scholar
FitzpatrickRDaveyCBuxtonMJJonesDREvaluating patient-based outcome measures for use in clinical trialsHealth Technol Assess1998217410103353
PubMedGoogle Scholar
ChoppinBAn item bank using sample free calibrationNature19682198708725673356
PubMed Web of Science ®Google Scholar
LinacreJComputer-adaptive testing: A methodology whose time has comeChaeSKangUJeonELinacreJDevelopment of Computerised Middle School Achievement TestsSeoul, KoreaKomesa Press2000
Google Scholar
WareJBjornerJKosinskiMPractical implications of item response theory and computer adaptive testing. A brief summary of ongoing studies of widely used headache impact scalesMed Care2000387382
Web of Science ®Google Scholar
WareJBrookRDavies-AveryAConceptualization and Measurement of Health for Adults in the Health Insurance Study: Volume I, Model of Health and MethodologySanta Monica, CAThe Rand Corporation1980
Google Scholar
McDowellINewellCMeasuring Health: A Guide to Rating Scales and Questionnaires1st edOxford, UKOxford University Press1987
Google Scholar
StreinerDNormanGHealth Measurement Scales: A Practical Guide to Their Development and Use1st edOxford, UKOxford University Press1989
Google Scholar
GuilfordJPsychometric Methods2nd edNew York, NYMcGraw-Hill1954
Google Scholar
NunnallyJTests and Measurements: Assessment and PredictionNew York, NYMcGraw-Hill1959
Google Scholar
ThurstoneLFechner’s law and the method of equal-apprearing intervalsJ Exp Psychol192912214214
Google Scholar
NunnallyJPsychometric Theory1st edNew York, NYMcGraw-Hill1967
Google Scholar
BrookRWareJDavies-AveryAConceptualization and Measurement of Health for Adults in the Health Insurance Study: Volume VIII, OverviewSanta Monica, CAThe Rand Corporation1979
Google Scholar
StewartAGreenfieldSHaysRFunctional status and well-being of patients with chronic conditions. Results from the Medical Outcomes StudyJ Am Med Assoc1989262907913
PubMed Web of Science ®Google Scholar
SpearmanCThe proof and measurement of association between two thingsAm J Psychol19041572101
Web of Science ®Google Scholar
NovickMThe axioms and principal results of classical test theoryJ Math Psychol19663118
Web of Science ®Google Scholar
StevensSOn the theory of scales of measurementScience1946103677680
PubMed Web of Science ®Google Scholar
MichellJMeasurement scales and statistics: A clash of paradigmsPsychol Bull1986100398407
Web of Science ®Google Scholar
RaschGProbabilistic Models for Some Intelligence and Attainment TestsCopenhagen, DenmarkDanish Institute for Education Research1960
Google Scholar
AndrichDControversy and the Rasch model: A characteristic of incompatible paradigmsMed Care200442I7I1614707751
PubMed Web of Science ®Google Scholar
WrightBStoneMBest Test Design: Rasch MeasurementChicago, ILMESA College Press1979
Google Scholar
AndrichDRasch Models for MeasurementBeverley Hills, CASage Publications1988
Google Scholar
WrightBLinacreJObservations are always ordinal: Measurements, however must be intervalArch Phys Med Rehabil1989708578602818162
PubMed Web of Science ®Google Scholar
McHorneyCHaleySWareJEvaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methodsJ Clin Epidemiol1997504514619179104
PubMed Web of Science ®Google Scholar
PrietoLAlonsoJLamarcaRClassical test theory versus Rasch analysis for quality of life questionnaire reductionHealth Qual Life Outcomes200312712952544
PubMedGoogle Scholar
EmbretsonSHershbergerSThe New Rules of MeasurementMahwah, NJLawrence Erlbaum Associates1999
Google Scholar
McHorneyCTarlovAIndividual-patient monitoring in clinical practice: Are available health status surveys adequateQual Life Res199542933077550178
PubMed Web of Science ®Google Scholar
WhitakerJMcFarlandHRudgePReingoldSOutcomes assessment in multiple sclerosis trials: A critical analysisMult Scler1995137479345468
PubMedGoogle Scholar
PlatzTEickhofCNuyensGVuadensPClinical scales for the assessment of spasticity, associated phenomena, and function: A systematic review of the literatureDisabil Rehabil20052771815799141
PubMed Web of Science ®Google Scholar
WrightBMastersGRating Scale Analysis: Rasch MeasurementChicago, ILMESA College Press1982
Google Scholar
WrightBSolving measurement problems with the Rasch modelJ Educ Meas19771497116
Web of Science ®Google Scholar
LordFApplications of Item Response Theory to Practical TestingMahwah, NJLawrence Erlbaum Associates1908
Google Scholar
HambletonRFundamentals of Item Response TheoryLondon, UKSage Publications1991
Google Scholar
NorquistJFitzpatrickRDawsonJJenkinsonCComparing alternative Rasch-based methods vs raw scores in measuring change in healthMed Care200442I25I3614707753
PubMed Web of Science ®Google Scholar
LordFNovickMStatistical Theories of Mental Test ScoresReading, MAAddison-Wesley1968
Google Scholar
MassofRThe measurement of vision disabilityOptom Vis Sci20027951655212199545
PubMed Web of Science ®Google Scholar
CookKMonahanPMcHorneyCDelicate balance between theory and practiceMed Care20034157157412719678
PubMed Web of Science ®Google Scholar
FisherWThe Rasch debate: Validity and revolution in education measurementWilsonMObjective Measurement: Theory into PracticeNorwood, NJAblex1992
Google Scholar
GoldsteinHConsequences of using the Rasch model for educational assessmentBr Educ Res J19795211220
Google Scholar
WrightBMisunderstanding the Rasch modelJ Educ Meas197714219225
Web of Science ®Google Scholar
DivgiDDoes the Rasch model really work for multiple choice items? Not if you look closelyJ Educ Meas198623283298
Web of Science ®Google Scholar
GoldsteinHWoodRFive decades of item response modellingBr J Math Stat Psychol198942139167
Web of Science ®Google Scholar
StennerASmithMTesting construct theoriesPercept Mot Skills198255415426
Web of Science ®Google Scholar
NichollLHobartJCrampALowe-StrongAMeasuring quality of life in multiple sclerosis: Not as simple as it soundsMult Scler20051170871216320732
PubMed Web of Science ®Google Scholar
AndrichDA framework relating outcomes based education and the taxonomy of educational objectivesStud Educ Eval2002283559
Google Scholar
AndrichDImplication and applications of modern test theory in the context of outcomes based educationStud Educ Eval200228103121
Google Scholar
HobartJRiaziAThompsonAGetting the measure of spasticity in multiple sclerosis: The Multiple Sclerosis Spasticity Scale (MSSS-88)Brain200612922423416280352
PubMed Web of Science ®Google Scholar
StreinerDNormanGHealth Measurement Scales: A Practical Guide to Their Development and Use2nd edOxford, UKOxford University Press1995
Google Scholar
NunnallyJIntroduction to Psychological MeasurementNew York, NYMcGraw-Hill1970
Google Scholar
MaurischatCEhlebracht-KonigIKuhnABullingerMFactorial validity and norm data comparison of the Short Form 12 in patients with inflammatory-rheumatic diseaseRheumatol Int20062661462116179999
PubMed Web of Science ®Google Scholar
BohrnstedtGMeasurementRossiPWrightJAndersonAHandbook of Survey ResearchNew York, NYAcademic Press1983
Google Scholar
CronbachLMeehlPConstruct validity in psychological testsPsychol Bull19555228130213245896
PubMed Web of Science ®Google Scholar
CampbellDTFiskeDWConvergent and discriminant validation by the multitrait-multimethod matrixPsychol Bull1959568110513634291
PubMed Web of Science ®Google Scholar
KerlingerFNFoundations of Behavioural Research2nd edNew York, NYHolt, Rinehart and Winston1973
Google Scholar
StennerASmithMBurdickDTowards a theory of construct definitionJ Educ Meas198320305316
Web of Science ®Google Scholar
RevickiDFDA draft guidance and health-outcomes researchLancet200736954054217307086
PubMed Web of Science ®Google Scholar
StennerABurdickHSandfordEBurdickDHow accurate are Lexile text measuresJ Appl Meas2006730732216807496
PubMedGoogle Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

The problem with health measurement

Abstract

Introduction

Key concepts

Rating scales in health measurement: a brief history

Rating scales in health measurement: type and kind

Psychometrics in health measurement: a brief history

Psychometric methods

Classical test theory

Rasch measurement methods

Comparison of classical test theory and Rasch measurement

Item response theory and Rasch measurement

Problem: our understanding of exactly what rating scales are measuring is limited

Can we solve the problem?

Conclusion

Disclosure

References

Information for

Open access

Opportunities

Help and information

The problem with health measurement

Abstract

Introduction

Key concepts

Rating scales in health measurement: a brief history

Rating scales in health measurement: type and kind

Psychometrics in health measurement: a brief history

Psychometric methods

Classical test theory

Rasch measurement methods

Comparison of classical test theory and Rasch measurement

Item response theory and Rasch measurement

Problem: our understanding of exactly what rating scales are measuring is limited

Can we solve the problem?

Conclusion

Disclosure

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date