3,698
Views
10
CrossRef citations to date
0
Altmetric
Methods and Modeling

Setting and maintaining standards for patient-reported outcome measures: can we rely on the COSMIN checklists?

ORCID Icon & ORCID Icon
Pages 502-511 | Received 27 Nov 2020, Accepted 15 Mar 2021, Published online: 26 Apr 2021

Abstract

As test-developers we have often been troubled by published reviews of patient-reported outcome measures (PROMs). Too often minor issues are judged important while other reviews exclude the best measures available. Perhaps this led several groups to make recommendations for evaluating the quality of PROMs. The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist is the latest set of recommendations. While reviewing the COSMIN literature and reviews conducted using their recommendations several concerns became apparent. The checklist is not evidence-based, relying on the opinion of researchers experienced in health-related quality of life. PROMs measuring other types of outcomes are inadequately covered by the checklist. COSMIN choose to focus on Classical Test Theory and the checklists are not appropriate for use with PROMs developed using modern measurement. Such an approach only obstructs progress in the field of outcome measurement. The retrospective nature of the evaluations also penalizes new PROMs. While the checklists imply that composite, ordinal level measurement is acceptable, crucial aspects of instrument development and quality are excluded. Reviews based on the COSMIN checklist produce contradictory conclusions and fail to provide evidence to support the recommendations. These problems suggest that the checklists themselves lack reliability and validity. It is also clear that several reviewers lack the expertise to apply the checklists. Researchers require a good grounding in instrument development and psychometrics to produce quality reviews. The science of modern PROM development is still in an early phase. Few available PROMs have sufficient quality, limiting the need for complex reviews. Standards need to be agreed for high quality outcome measurement. The issue is who should set these standards? Most published reviews merely scratch the surface and lack essential detail. All reviews of PROMs should be treated with caution, irrespective of whether the COSMIN checklist was employed.

JEL CLASSIFICATION CODES:

View addendum:
COSMIN reviews: the need to consider measurement theory, modern measurement and a prospective rather than retrospective approach to evaluating patient-based measures

Introduction

There are now many thousands of patient-reported outcome measures (PROMs) – most of which were developed for one specific study and never used again. How do researchers and clinicians choose the most appropriate outcome measures for their study? Furthermore, how is it possible to determine whether the instruments are accurate enough to detect meaningful change? Peer-reviewed articles should provide the answers to these questions, but they rarely report the evidence needed to guide decision making. Highly experienced researchers can probably cope with an absence of evidence, but such skills are relatively rare.

The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) Checklists (described below) could be considered a step in the right direction, but it will be argued that they fall far short of providing the information necessary to judge the quality of PROMs. Unfortunately, there are many ways in which the COSMIN process can be easily frustrated, as there is no guarantee that the reviewer will use the checklists as required. This is not a new problem as attempts to judge quality and accredit outcome measures have failed in several other disciplines. For example, The Standards for Educational and Psychological TestingCitation1 are ambitious guidelines rather than a list of standards. Consequently, they carry no weight and leave people to interpret them themselves.

The Mental Measurements YearbooksCitation2 are designed to assist professionals in selecting and using standardized tests. Since 1938 the Yearbooks have provided information on, and critical reviews of, the construction, use, and validity of commercially available measures in education and psychology. It is an independent review of the quality of instruments, but there is a lack of standardization across the reviewers.

There are no accreditation processes for PROMs. The closest are the recommendations from the US FDA that are required to be met to make quality of life claims for pharmaceutical products. However, no evidence is provided to support their recommendations and the FDA are advised by the some of the people who worked on the COSMIN checklist.

This article assesses the content and performance of the COSMIN checklists that are intended to guide systematic reviews of patient-reported outcome measures. These reviews, in turn, are aimed at aiding the selection of the most appropriate PROMs for clinical studies. The article questions whether the checklists can be used to accredit existing questionnaires, as some form of accreditation would clearly be of value. If this were possible, the quality of outcome measurement in medicine could be considerably improved.

Background

The aim of COSMIN was to improve the quality and selection of PROMs by developing and encouraging the use of a transparent methodologyCitation3. The COSMIN team set the following aims for their checklist:

  1. Advance the science and application of health outcome measurement.

  2. Develop new and update existing methodology and practical tools for the selection and use of outcome measurement instruments for research and clinical practice.

  3. Monitor and maintain the scientific quality of COSMIN tools.

  4. Encourage widespread adoption of the COSMIN methodology.

  5. Standardize outcomes and outcome measurement instruments by developing Core Outcome Sets and methodology.

It is difficult to see how a retrospective review system could address points 1 and 2. Points 4 and 5 suggest the desire to promote the COSMIN approach above others. Currently, the main criterion for the choice of a PROM is whether it has been widely used – particularly in clinical trials (irrespective of its quality or suitability for the study). It is difficult to free perennial users from the selection of relatively poor PROMs and it is certainly not the case that PROMs that have been used most frequently are the best availableCitation4. This suggests that there is a case for specifying how reviews should be conducted.

The COSMIN initiative was founded in 2005. The original checklist was replaced in 2018 by a risk of bias checklist that is intended to match Cochrane ReviewsCitation5. Bias is interference in the outcomes of research by predetermined ideas, prejudice or influence in a certain direction. Both data and data analysts can be biased. Types of bias include selection bias where the sample is not representative of the population, performance bias, which depends on the success of blinding in trials, detection bias, where people are aware of the intervention they are receiving, attrition bias caused by missing data and reporting bias – the selective reporting of outcomes. Some of these sources of bias are relevant for instrument reviewsCitation6. The COSMIN steering committee decided how the checklist should be adapted to become the COSMIN Risk of Bias checklist. The changes included reordering the measurement properties, determining internal consistency for each scale or subscale separately, with Cronbach’s alpha the preferred statistical method. This article aims to raise questions about the content of COSMIN checklists and their performance. It is not intended to produce an alternative technique for conducting structured reviews but might encourage the COSMIN group or other researchers to reconsider the methodology that they employ. It is not assumed that the task they have addressed is easy – it has involved a great deal of work. However, it is too early to suggest that the COSMIN approach has answered the many questions raised by poor instrument development.

Specific issues in the operation of the COSMIN checklist are raised below.

The COSMIN checklist is based on opinion not evidence

The COSMIN group argued that there was no empirical evidence to guide the reviews of PROMsCitation7,Citation8. Consequently, the guidelines were based on the experience of the COSMIN steering committeeCitation9. Several other researchers had previously produced methodologies for such reviews Citation10–14. Some of the COSMIN authors were also involved in these earlier attempts to standardize reviews.

Due to the lack of evidence, a Delphi study was conducted to determine the views of many researchers in the field of health status measurement on how Health-Related Quality of Life (HRQL) outcome measures should be developed and reviewedCitation7. It appears that developers of other types of PROMs were not invited to participate or turned down the invitation. As COSMIN is an intuitive method of test selection, it is subjective and dependent on the expertise and experience of the COSMIN committee members.

There are several types of PROMs available, covering a wide range of types of outcomes; health status, utility, health-related quality of life and authentic quality of life. Each may need a different development methodology. The type of outcome each instrument assesses needs to be made clear and how the development methodology differs from other types of outcome.

Creating a checklist that is not based on evidence but on opinions is not an unusual practice in the PROM field. Guidelines for instrument development and adaptation have never been evidence basedCitation15–17. Relying on experience rather than evidence also means that recommendations are based on dated methodologies rather than on looking to the future. It also assumes that the COSMIN researchers already know how reviews should be done.

The following sections discuss important issues related to conducting systematic reviews of PROMs.

Construct theory

A construct theory (conceptual model) shows the underlying structure of the latent construct being measured. It is a method of defining a variable in terms of a limited set of relevant predictor variables (items). The validity of a construct theory reflects the extent to which it predicts variation in item values and person scoresCitation18. Construct theories can be developed from qualitative data to illustrate the areas of impact of a condition and the inter-relationships between them. These models can then be used to develop instruments or to judge the quality of existing PROMs. However, little attention is paid to construct validity when evaluating measures. It appears that a list of symptoms and functional limitations is considered a construct theory by most reviewers.

The perfect measurement of latent constructs requires two indispensable qualities; a coherent construct theory and a specification equation. The specification equation is “a regression model, based on the theory, that forecasts calibrations on the scale”Citation19. Together with a construct theory, this achieves theory-based instrument calibration. Unfortunately, most PROMs contain lists of different types of outcomes associated with a disease. These different outcomes cannot be added together validly to form a coherent measure. This is not a conceptual model, but a list of issues considered important by the PROM developers. Too often the model is assumed to be self-evident or that the individual variables are appropriate as they are “all related to” the outcome of interest.

While equations used in the physical sciences represent perfect fundamental measurement, only one latent variable, the Lexile Framework for Reading, has approached this level of measurement. The Lexile Framework was based on a construct theory that resulted from analyzing several different measures of reading ability. A specification equation was identified and validated that was able to link the construct theory to scores obtained with the Lexile measureCitation19. The specification equation (involving sentence length and frequency of words used in literature) was able to predict scores on the measure accurately. While a construct theory may start as an idea about why items and people order themselves in a consistent manner, the specification equation provides a means of allowing hypotheses to be related to measurement.

In contrast to this scientific approach to developing validity, COSMIN suggests that it is only necessary to check with patients and professionals whether the content of a PROM is relevant, comprehensive, and comprehensible. This provides inadequate evidence of content validity. Despite this, the COSMIN authors consider content validity to be the most important aspect of the checklist.

It is crucial for test publishers to describe what the instrument is supposed to measure clearly and thoroughly. Quality of life, functional ability or health status are very different types of outcome, are developed in different ways, and need to be tested in appropriately designed studies.

Importance of measurement theory

Measurement theory describes rules that should be applied in measurement and restrictions on analyses of alternative types of scaling. To assess the quality of PROMs it is necessary to determine whether the analyses undertaken with each measure can be justified. Consequently, it would be expected that a systematic review of PROMs would provide evidence of whether measurement theory has been applied appropriately during instrument development.

Objectivity is a fundamental requirement of measurement. Objective measurement ensures that an amount of the unit measured maintains its size, whichever measurement instrument is used and who or what is being measuredCitation18. Objectivity requires the creation of unidimensional measurement scales, as the meaning of the score on a measure requires the identification of the single variable that is being measured. Measurement scales should be unidimensional, but this is rare in PROMs.

StevensCitation20 described the main types of measurement scale; nominal, ordinal, interval and ratio. Wright and LinacreCitation21 stated that observations are always ordinal, and that measurements should be interval. Merbitz and colleaguesCitation22 pointed out that something must be done with counts of observed events (ordinal data) to build them into measures that generate interval level data. Almost all PROMs generate ordinal data. For example, the common practice of using Likert type response formats does not produce interval scales. Such formats are clearly ordered from less to more but there is no information about the size of the differences between the options.

StevensCitation20 pointed out that means and standard deviations should not be reported for ordinal scales. This is because these statistics imply knowledge of something more than the relative rank-order of data. For the same reason, parametric analyses should not be employed with ordinal scales. The major problem for ordinal PROMs is that the individual items cannot validly be added to give a total score. Similarly, different scales cannot be added together to give a total score – a common practice in outcome assessment. Unfortunately, ordinal scores are frequently manipulated mathematically or statistically as if they produce ratio data by instrument developers. The results of such operations are not valid and can lead to inappropriate conclusionsCitation22.

As Wright and StoneCitation23 have explained, measurement requires the use of interval, abstract quantities expressed in linear abstract units, whose meaning does not change along the scale. Outcome measures that fit the Rasch model are unidimensional and produce interval level data. However, experts in Rasch Measurement Theory (RMT) suggest that where the data fit the model it is possible to construct ratio scalesCitation23. Wright and LinacreCitation21 report that Rasch deduced a mathematical model that specified how to convert observed counts into linear (and ratio) measures. They argue that the location of the scale’s origin is fundamentally arbitrary, for the convenience of its users. A simple arithmetical operation can then convert an interval scale to a ratio scale and vice versa. Consequently, Rasch's use of the term "Ratio Scale" differs from that of Stevens'Citation24. RaschCitation25 specifies a scale of ratios, analogous to the measurement of decibels which use a logarithmic rather than linear scale. Merbitz and colleaguesCitation22 concluded that research should be directed toward the development of interval measures rather than the use of resources on producing ‘fundamentally flawed ordinal scales’.

Any review of available PROMs should report whether they produce unidimensional measurement that achieves interval or ratio level data.

Importance of unidimensional measurement

Concrete objects are multidimensional. For example, people can be described in terms of several different variables, including height, nationality, intelligence, or social class. However, it is possible to consider each of these abstract qualities one at a time. For example, it is possible to order people in terms of their QoL, which can be measured on a continuum from less to more. All measurement must be unidimensional.

Latent constructs (such as health status or quality of life) cannot be observed or measured directly. Latent constructs require a set of indicators (e.g. questionnaire items) that represent the latent construct in order to represent the whole variable. Health status may be defined as a set of different variables including pain, emotional distress, physical mobility, anxiety, etc. Each of these outcomes can be measured on a separate numerical scale, but they cannot be summed to give a total score. HRQL researchers often argue that QoL is multidimensional and they consider it necessary to measure a range of different relevant outcomes (different symptoms and/or functional limitations). However, Beckie and HayduckCitation26 contradicted this view. They considered quality of life to be a global, yet unidimensional, subjective assessment of one's satisfaction with life.

A unidimensional scale is one where each item in a scale measures some aspect of the same construct. Most latent variables are unidimensional, and this quality should be confirmed for all PROMs that are being evaluated or considered for use in a study. Assessment of the monotonicity of items in a measure is crucial to unidimensional measurement. Monotonicity indicates that the proportion of people “passing each step (their ability)” on the rating scale is greater for those with a higher score on the measureCitation27.

Where HRQL measures are used that cover a range of different outcomes, it is essential that each of the separate outcomes are measured with a unidimensional scale. What is not valid is then to add together the different outcomes and treat them as a multidimensional scale. Instead, a profile of the different outcomes measured should be presented.

Where scales are not unidimensional it is likely that wrong conclusions will be drawn about the nature of the latent trait being measured. Unidimensionality should be confirmed before valid conclusions can be drawn from the dataCitation28. Modern measurement can ensure that measures are unidimensional.

The problem with composite measures

A composite measure is one that is made up of two or more variables that are related to one another conceptually or statistically. Such measurement is common in medicine and economics, but rare in other disciplinesCitation29. Where the developer of a PROM does not have a clear conceptual model for its content and purpose, it is likely that it will assess more than one construct. All the items in a PROM should measure the same construct and evidence is required that this is the case.

There is little value in reviewing composite PROMs. They continue to be used widely but rarely produce valid information. Following from this, why do reviewers still evaluate PROMs that have multidimensional outcomes? Even if an attempt is made to prove that each separate domain is unidimensional, which ones are most important? The same question applies to the COSMIN checklist. This is a composite, multidimensional measure in which the individual judgements cannot be validly added to give an overall performance. The COSMIN authors state that 75% of the checklist needs to be ‘passed’ by each PROM. However, given that the COSMIN checklist generates a composite score that measures at the ordinal level at best, the 75% pass rate is likely to be misleading. It would be possible for a measure to pass the least important requirements and fail the most crucial.

The importance of modern measurement

As was argued above, unidimensionality is an essential requirement of the measurement of latent traits. This explains why CTT is gradually being replaced by Item Response Theory (IRT) and RMT. However, the value of modern measurement is that in addition to producing unidimensional measurement, it confirms internal construct validityCitation30. Items that misfit the measurement model are identified and can be removed. RMT has the added advantage of converting ordinal level measurement into interval scaling, satisfying one of the three main requirements for fundamental measurement. These are:

  • the numerical properties of order (one mark on the ruler represents more or less of the construct than another);

  • addition (points on rulers may be added together) - this requires interval level data;

  • specific objectivity (the calibration of the ruler (item set or questions) is independent of the persons used to calibrate it and vice versa).

Where data fit the Rasch model these properties are confirmed and fundamental measurement followsCitation31.

RMT is an important step in moving towards the quality of measurement achieved in the physical sciences. It allows construction of abstract linear measures of high quality. It enables the development of PROMs and ensures quality control. The Rasch model produces specific objectivity (which has great advantages such as using the raw sum score as a sufficient statistic). RMT also allows the creation of item banks and enhances comparisons of performance in longitudinal studiesCitation23,Citation32.

In the original COSMIN checklist, it was suggested that only limited information is required to justify conclusions drawn about the scaling properties of instruments that reported using modern measurement modelsCitation7. Coverage of RMT and IRT were further downplayed in the development of the risk of bias checklistCitation5. The argument for these changes are unclear. They seem to arise from Cochrane’s view that the quality of studies should be considered when judging the results that are reported. Certain IRT requirements were removed. The IRT model, computer software package used and the method of estimation were not considered relevant to the quality of the study. However, they are crucial for judging whether the methodology used was appropriate and whether the results reported are accurate. They also removed the need to check whether unidimensionality and local independence were checked. This information is also essential for judging the quality of the study. To judge the quality of RMT analyses and interpret their meaning, detailed information needs to be provided about fit residuals, response thresholds, differential item functioning, item dependency, and person-item distributions. Reviewers should report whether these findings are included in instrument development publications.

Focus on classical test theory methods

The COSMIN checklist has a strong focus on Classical Test Theory (CTT). This could be because most PROMs use this methodology in test development and validation. It is also likely that both writers and readers of articles can understand CTT analyses, while they are unfamiliar with modern measurement. Even where authors claim that they used RMT, they generally fail to present adequate information to allow judgements to be made about a PROM’s unidimensionality and, hence, qualityCitation33,Citation34.

RMT and IRT have been available since the 1960s with the Rasch model being more strictly based on the definition of measurement while IRT models adopt a statistical approach. Despite this, it is still relatively rare to see instruments developed using RMT or IRT.

This continued common use of CTT methods is surprising, as several researchers have been advocating the benefits of RMT for PROM development for several years. There are many examples of the benefits of RMTCitation31,Citation35. Many researchers select IRT or RMT because of the higher standards of measurement they provide. They also offer options for test equating, computer adaptive testing and test score interpretation.

Siemons and colleaguesCitation36 concluded that COSMIN ratings are only suitable for analyzing the quality of instruments developed using CTT. However, COSMIN accept the use of factor analysis to develop “unidimensional” scales. Like Confirmatory Factor Analysis (CFA), Rasch analysis is a confirmatory approach to examining whether items belong to the scale under investigation. However, there are known limitations to using factor analysis on ordinal scales, including its parametric basis.

Rusch et al.Citation37 report on the most important limitations of factor analysis:

  • It assumes a linear relationship between the latent variable and observed scores, which is rarely the case.

  • The true score can either not be estimated directly or only by making assumptions that are difficult to support;

  • Parameters such as reliability, discrimination, location, or factor loadings depend on the sample being used.

The article also argues that IRT offers several advantages:

  • It assumes nonlinear relationships.

  • It allows more appropriate estimation of the true score.

  • It can estimate item parameters independently of the sample being used.

  • It allows the researcher to select items that are in accordance with a desired model; and

  • It applies and generalizes concepts such as reliability and internal validity, and thus allows researchers to derive more information about the measurement process.

CFA can be used in combination with RMT but is unlikely to be effective on its own. When there is a need to confirm the suitability of items, to check for dimensionality and to look at evidence of local dependence, differential item functioning and item fit (all of which are important in instrument development), factor analysis is inadequateCitation30. However, where it is required to reduce a large set of items to a small number of summary scale scores, both approaches are neededCitation38.

Focus on health-related quality of life (HRQL) measures

HRQL PROMs can be useful in providing data that clinicians will consider relevant. However, this construct offers little to patients (the true experts). It is well accepted that patients should be involved in their own treatment. Surely, they should also be involved in deciding whether interventions improve their lives. This can be done by selecting PROMs more appropriate for that purpose than measures of health statusCitation39.

The content of the COSMIN checklist is heavily biased towards HRQL. The authors state that they are concerned with health status measurement. This decision omits a wide range of other types of PROMs. Measures of quality of life, health behavior, satisfaction, patient-reported experience and preference-based instruments are all excluded. Omission of these types of PROMs means that users of the COSMIN checklist are directed to health status measures and kept away from other types of outcome that may often be more suitable for the end-users. Reviewers are often unaware of the different types of outcome measures they are reviewing. It is inappropriate to treat all PROMs as HRQL measures. A checklist focused on HRQL is unsuitable for evaluating other types of outcomes.

This limitation of coverage of the full body of literature could result in researchers from one area of PROM measurement imposing their values on other groups of researchers (one of the original aims of the COSMIN group). The large numbers of articles published describing COSMIN and its checklists, in very similar ways could have the same effect. There is still a long way to go in understanding PROMs and identifying the best development processes. It is not appropriate to act as the arbiters of how things should be done in the absence of evidence.

Performance of the COSMIN checklist

A checklist based on opinion is not open to evaluation. Consequently, there is limited justification for its use. The checklist contains few guidelines on how to judge the quality of the PROMs reviewed. It is up to reviewers to decide whether articles reviewed imply good or bad results. In this respect, it allows reviewers to continue in the way they always have. The difference now is that they can say that they used COSMIN guidelines to “validate’ their work. The idea that application of the COSMIN guidelines would validate reviews is misplaced. Validity is constantly evolving based on new evidence. While clinicians and others may say that an impartial third party has concluded that an instrument is good, there is no guarantee that the third party is impartial. This is particularly problematic where reviewers fail to distinguish the different types of PROMs.

Reviews that claim to be based on the COSMIN checklist vary considerably in quality and there is a general lack of consistency. PROMs recognized for their poor performance are frequently rated good but may well also be rated poor by other reviewers. Unfortunately, the results of COSMIN-based reviews are generally not supported by evidence presented by the reviewers. The checklist is mostly used to review ordinal, composite PROMs, that are, by definition, of poor quality.

SzekeresCitation40 made some positive remarks about the COSMIN checklist. However, he concluded that although his review produced a low overall quality score, DASH translations may still be some of the more valid, responsive, and clinically useful PROMs available. This appears to raise questions about the review process. SzekeresCitation40 also pointed out that the “content validity of the COSMIN checklist has only been established by the developers and needs to be evaluated by impartial experts”. He also reported that the inter-rater reliability of the COSMIN checklist is low. Percentage agreement of ratings was below 80% for one third of the recommendationsCitation41.

It is interesting to look at the outcome of reviews based on the COSMIN checklist. For this purpose, a brief review was conducted of two widely used generic PROMs. The SF-36 and its variants and the different EQ-5D versions. These measures were selected as they are frequently reported to have poor psychometric propertiesCitation42–45. It would be expected that the results of reviews using the same methodology would produce similar results. This was not found to be the case.

The SF-36 was judged adequate in two reviewsCitation46,Citation47. In contrast, Eyles et al.Citation48 found the SF-36 unhelpful and Janssen and colleaguesCitation49 reported that only one of the eight SF-36 sections was acceptable. Two studies reported the EQ-5D to be adequateCitation50,Citation51. A further two found it to be inadequateCitation52,Citation53. Craxford and colleaguesCitation54 judged both of these generic PROMs to be inadequate.

A study by Poku et al.Citation55 was interesting insofar as it found both the EQ-5D and SF-36 measures inadequate and recommended the use of the Nottingham Health Profile (NHP). As developers of the NHP we took the decision to stop recommending its use due to inadequate psychometric properties several years ago.

Overall, this short review gives the impression that certain reviewers who used the COSMIN checklist are judging poor quality PROMs as adequate or even good. This is despite the PROMs’ inability to show differences between interventions or changes over time. These contradictions cannot be explained by differences in the populations studied, as both instruments measure generic health status. Such findings raise questions about the reliability and validity of the COSMIN checklist. Surprisingly, in a response to the criticisms of AngstCitation56, the COSMIN authors stated that ‘the COSMIN panel has set high standards’.

To judge the quality of the COSMIN-based reviewing process, the article by Chen and colleaguesCitation57 is informative. The review was intended to assess the quality of PROMs used to study inflammatory bowel disease. The Inflammatory Bowel Disease Questionnaire (IBDQ)Citation58 is the most used PROM in gastroenterology. This is largely a result of its age – it was developed in the 1980s. A newly developed PROM, the Crohn’s Life Impact Questionnaire (CLIQ)Citation59 was also included in the review.

Little information about the CLIQ was presented but what was included was inaccurate. The content of the questionnaire was not developed from a literature review, as stated in the review – its content was derived entirely from qualitative patient interviews. The measure also consists of a single, unidimensional scale – not two different domains as reported in the review. The reviewers incorrectly stated that the CLIQ assesses emotional functioning when the measure assesses needs-based quality of life. This is possibly because the reviewer was limited by the narrow range of coverage of the checklist. The review did mention that the CLIQ is unidimensional – the only reference to this quality in the whole review. The CLIQ was the only measure reviewed that was unidimensional.

There were also problems with the review of the IBDQ. The reviewers reported that this measure contained three main domains: symptom, social and emotional domains. In fact, the questionnaire consists of four domains, but the measure is usually treated as a composite index with a single score. Adding scores in this way is invalid and confirms that the measure is multidimensional.

The IBDQ was judged to have good measurement properties by the reviewers. Its internal consistency was marginal (0.7) considering the length of the questionnaire. More worrying, the reproducibility was also 0.7Citation60. This would be useful information for readers as it means that the smallest detectable difference for the IBDQ is 60 points. Consequently, the IBDQ score must change by at least 60 points (over 30% of its measurement range) to be certain that the change in score had not happened by chance. Trial findings show improvements in IBDQ score for active treatment of between 10 and 20 pointsCitation61–63. Such information shows clearly that the IBDQ is not sufficiently responsive to detect true changes in health status.

The CLIQ scored consistently higher than the IBDQ on the COSMIN criteria reported. However, the reviewers concluded that the IBDQ had the strongest evidence of reliability, validity and responsiveness for adult IBD patientsCitation57. No mention was made of the CLIQ in the results section. The authors concluded that the review ‘better guides the use of IBD-specific HRQL instruments and helps clinicians and researchers choose appropriate IBD instruments’! Perhaps this explains why older measures are so often judged the best in reviews - irrespective of their true quality.

In a better researched review of IBD measures, it was concluded that most of the available IBD-specific PROMs lack both a clear definition of the construct of interest and patient involvement in the development and evaluation of their qualityCitation64. Future research should focus on defining the constructs of interest for IBD populations and performing qualitative studies with IBD patients to design new instruments. The authors did not make it clear that this was the method used to develop the CLIQ.

Two COSMIN based reviews were found of Foot and Ankle-Specific Questionnaires. Sierevelt et al.Citation65 only judged the properties of the three most used scales, suggesting that they believe that frequency of use is an important criterion of quality. The authors reported that the qualities of the measures they reviewed were fair to poor. Despite this they concluded that the Foot and Ankle Outcome Score (FAOS) and the Foot and Ankle Activity Measure (FAAM) are promising outcome measures. However, they also warned that the shortcomings of these measures should be considered when interpreting results in clinical settings or trials. The FAOS contains items that refer to impairments, disabilities, participation problems and QoL, while the FAAM is limited to documenting disabilities. As the two instruments have different content and purpose it is questionable how they can be considered equivalent in value.

Jia et al.Citation66 reviewed 50 Foot and Ankle-Specific instruments. They reported that most of these PROMs had limited evidence of their psychometric properties. They did not rate the FAOS or FAAM very highly and concluded that the Manchester-Oxford Foot Questionnaire (MOXFQ), had the highest overall ratings and could be a useful PROM for evaluating patients with foot or ankle diseases. Surprisingly, the authors reported that the studies reviewed were of poor quality and that no evidence of content or criterion validity could be found. It should also be noted that the MOXFQ has three domains and yet these are added together to give a single score converted to a scale from 0 to 100Citation67.

Interestingly, a further review of these PROMs has been publishedCitation68. This review was conducted prior to the introduction of the COSMIN checklist. The authors concluded that the Foot and Ankle Disability Index (FADI) and the FAAM were the most appropriate.

All three reviews left out important information about how the measures were developed, and the item reduction process. While it would be expected that reviews using the same methodology would produce the same results, this is clearly not the case. Rather than just accept what authors say, it is essential that reported data are evaluated by the reviewers. This would be very time consuming but would be expected to improve the quality of the review.

This short appraisal of the performance of the COSMIN checklists suggests that they do not produce consistent or valid reviews. Consequently, it questions whether confidence can be placed in published reviews that claim to apply the COSMIN checklists.

Who conducts the reviews?

A problem for the COSMIN committee is that many of the reviewers claiming to apply their checklist are not experts in measure development or evaluation. They may well be experts in clinical aspects of the disease studied but this is very different. Most PROMs have been developed by people who are clinically trained and have particular interest in PROMs that assess health status. This construct is very different from judging whether patients benefit from interventions.

Many problems identified by assessing COSMIN based reviews, are the result of unqualified reviewers trying to apply the checklists. Perhaps COSMIN should specify which reviews meet their criteria for quality.

Another question is why reviews are conducted. Too often it appears that test developers conduct a review to prove the need for a new measure. If they find that an existing measure is of high quality, there would be no need for a new one. But there are several reasons why a review is needed. These include relating the construct covered by the PROM with the specific outcomes required by a study, whether the PROM is of sufficient quality and responsiveness to detect changes in the construct measured or its acceptability to respondents. This diversity of needs can be a problem – as it could confuse readers.

Discussion and conclusions

The results of PROMs used in clinical trials are rarely reported. This is because they are largely unable to detect change and are poorly understood by the researchers. At present the selection of PROMs is largely a lottery with little scientific basis. Too often the same PROMs are selected because they have been used in previous studies with similar populations. This is despite their unsuitability for the purpose.

Surprisingly, there is little evaluation of the COSMIN checklists in the literature. Similarly, it is difficult to find articles criticizing their approach. It is hoped that this article may initiate a discussion about the evaluation and selection of PROMs.

The COSMIN checklists have not improved the situation since they were introduced. It is possible that reviewers find the COSMIN procedures complex to follow. Five COSMIN manuals were found in the literature, totaling 275 pages. Much of this information is also duplicated in journal articles.

The development of the COSMIN guidelines for conducting reviews of PROMs represented a great deal of work. However, the checklists developed are not evidence-based, instead using the opinions of several people with some experience of PROM development or use. Unfortunately, the COSMIN focus is on PROMs assessing health status. A wide range of other types of PROM are omitted, presumably because their developers were excluded from the COSMIN checklist development process. A consequence is that the COSMIN authors risk suggesting to readers that other types of PROM are inappropriate. They also fail to encourage the use of modern methodologies.

The assessments included in the checklist omit crucial indicators of quality. Construct theories, fundamental measurement, unidimensionality, internal validity, item generation and reduction are all largely missing from the requirements or are inadequately described. Adoption of the COSMIN checklist would hold back the development of the science of outcome measurement as reviewers will be searching for retrospective standards and excluding a wide range of types of PROM.

A particular problem for COSMIN (or any other review procedure) is that it relies on the ability of the reviewers to evaluate the information in publications – where the article authors make claims about the quality of their PROMs. Too often reviews are made by people who have limited experience or expertise in PROM development and assessment. The evaluation of reviews using the COSMIN checklists confirms that to conduct an adequate review it is necessary to have a good grounding in measurement and psychometrics.

This article makes it clear that the COSMIN checklists have several problems that need to be resolved before they could become a useful standard for test developers and for people looking for the most appropriate instrument for their purposes.

Too little attention is paid to the theoretical basis of measures. A list of issues addressed by the questionnaire does not represent a model. Why was the measure developed, what was it designed to measure, to what extent are differences in scores explicable in terms of the conceptual model?

There is a need to consider the implications of measurement theory on the design of PROMs. Does the PROM provide objective measurement? Does it produce ordinal or truly interval measurement? Are appropriate statistical tests employed when testing validity?

If the measure has not been developed using RMT how can it be shown that it measures at the interval level? Why is classical test theory still used to develop new scales? RMT and IRT produce much more valid and powerful scales. Where the data fit the RMT model, unidimensionality is achieved. It is time to bite the bullet and move from dated measurement models to modern models – especially RMT measurement.

At present reviewers accept the information and interpretation provided by test-developers. This is not surprising when there are no real standards accepted in the literature. Reviews should only cover measures that have the same purpose to avoid confusing readers. Thus, utility, health status, satisfaction and quality of life measures should be reported in different articles or in separate sections of a review. This would prevent the wrong kind of PROM being selected for studies.

Few PROM development articles provide the information necessary for assessing their quality. Such PROMs should be excluded from reviews. This would not be a problem as far too many scales of limited quality are developed. It could be argued that excluding such PROMs would be too restrictive. However, few PROMs have been effective in clinical studies and trials. Their use is unfair on patients, as their views and voice are not being heard.

Difficult questions include who should set standards, and who should make judgments about instrument quality. Should a body such as COSMIN, who have limited skills available to them, decide? Should the International Society for Quality of Life Research or the International Society for Pharmaceutical Outcome Research decide? Is it the responsibility of governments through organizations such as the National Institute for Health and Care Excellence or the Food and Drug Administration to state what is important? We know that all these approaches are largely ineffective and (often correctly) ignored by test developers.

As modern measurement techniques are only gradually being introduced to PROM development, it may be too early to prescribe how measures should be evaluated. At present it is advisable to take personal responsibility for judging the quality of PROMs or relying on trusted psychometricians who are aware of modern measurement requirements.

Transparency

Declaration of funding

There is no sponsorship/funding to declare.

Declaration of financial/other relationships

The authors are employees of Galen Research Ltd., which develops patient-reported outcome measures.

JME peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Author contributions

SPM contributed to the conception of the work, performed the literature search and produced the first draft of the manuscript. AH contributed to interpretation of the relevant literature and critically revised the work. The final version was read and approved by both authors.

Acknowledgements

No assistance in the preparation of this article is to be declared.

References

  • Hambleton R, Arrasmith G, Sheehan D, et al. Standards for educational and psychological testing: six reviews. J Educ Meas. 1986;23:83–98.
  • Buros OK (editor). The nineteen thirty-eight mental measurements yearbook. New Brunswick (NJ): Rutgers University Press; 1938.
  • Mokkink LB, Prinsen CA, Bouter LM, et al. The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) and how to select an outcome measurement instrument. Braz J Phys Ther. 2016;20(2):105–113.
  • Hendrikx J, de Jonge MJ, Fransen J, et al. Systematic review of patient-reported outcome measures (PROMs) for assessing disease activity in rheumatoid arthritis. RMD Open. 2016;2(2):e000202
  • Mokkink LB, De Vet HC, Prinsen CA, et al. COSMIN risk of bias checklist for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1171–1179.
  • Higgins JPT, Green S, (editors). Cochrane handbook for systematic reviews of interventions. Version 5.1.0 (updated March 2011). Cochrane. Chichester: Wiley; 2011. Available from https://training.cochrane.org/handbook/archive/v5.1/
  • Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539–549.
  • Mokkink LB, Terwee CB, Knol DL, et al. The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: a clarification of its content. BMC Medical Res Methodol. 2010;10:1–8.
  • Prinsen CA, Mokkink LB, Bouter LM, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147–1157.
  • Kirshner B, Guyatt G. A methodological framework for assessing health indices. J Chronic Dis. 1985;38(1):27–36.
  • Bombardier C, Tugwell P. Methodological considerations in functional assessment. J Rheumatol. 1987;14(Suppl 15):6–10.
  • Streinmer DL. A checklist for evaluating the usefulness of rating scales. Can J Psychiatry. 1993;38(2):140–148.
  • Alrubaiy L, Hutchings HA, Williams JG. Assessing patient reported outcome measures: a practical guide for gastroenterologists. United European Gastroenterol J. 2014;2(6):463–470.
  • Van Zile-Tamsen C. Using Rasch analysis to inform rating scale development. Res High Educ. 2017;58(8):922–933.
  • Committee for Medicinal Products for Human Use. Reflection paper on the regulatory guidance for the use of health-related quality of life (HRQL) measures in the evaluation of medicinal products. London (UK): European Medicines Agency; 2005. (EMEA/CHMP/EWP/139391/2004).
  • U.S. Department of Health and Human Services FDA Center for Drug Evaluation and Research., U.S. Department of Health and Human Services FDA Center for Biologics Evaluation and Research., & U.S. Department of Health and Human Services FDA Center for Devices and Radiological Health. Guidance for Industry: patient-Reported Outcome Measures: use in Medical Product Development to Support Labeling Claims: draft Guidance. Health Qual Life Outcomes. 2006;4:79.
  • Eremenco S, Pease S, Mann S, et al. Patient-reported outcome (PRO) consortium translation process: Consensus development of updated best practices. J Patient-Reported Outcomes. 2018;2:12.
  • McKenna SP, Heaney A, Wilburn J, et al. Measurement of patient-reported outcomes. 1: The search for the Holy Grail. J Med Econ. 2019;22(6):516–522.
  • Stenner AJ. Measuring reading comprehension with the lexile framework. Durham (NC): MetaMetrics, Inc; 1996.
  • Stevens SS. On the theory of scales of measurement. Science. 1946;103(2684):677–680.
  • Wright BD, Linacre JM. Observations are always ordinal; measurements, however, must be interval. Arch Phys Med Rehabil. 1989;70(12):857–860.
  • Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of misinference. Arch Phys Med Rehabil. 1989;70(4):308–312.
  • Wright BD, Stone MH. Best test design. Chicago (IL): MESA Press; 1979.
  • Koch W, Schulz EM, Wright R, et al. What is a ratio scale? Rasch Measurement Transactions. 1996;9(4):457.
  • Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. Chicago (IL): MESA Press; 1992.
  • Beckie TM, Hayduk LA. Measuring quality of life. Soc Indic Res. 1997;42(1):21–39.
  • Velozo CA, Seel RT, Magasi S, et al. Improving measurement methods in rehabilitation: core concepts and recommendations for scale development. Arch Phys Med Rehabil. 2012;93(8 Suppl):S154–S163.
  • Heene M, Kyngdon A, Sckopke P. Detecting violations of unidimensionality by order-restricted inference methods. Front Appl Math Stat. 2016;2:3.
  • Ley P. Quantitative aspects of psychological assessment: an introduction. Oxford (UK): Duckworth & Co; 1972.
  • Kersten P, Vandal AC, Elder H, et al. Strengths and Difficulties Questionnaire: internal validity and reliability for New Zealand preschoolers. BMJ Open. 2018;8(4):e021551.
  • Tennant A, McKenna SP, Hagell P. Application of Rasch analysis in the development and application of quality of life instruments. Value Health. 2004;7:S22–S26.
  • Planinic M, Boone WJ, Susac A, et al. Rasch analysis in physics education research: Why measurement matters. Phys Rev Phys Edu Res. 2019;15(2):020111.
  • Hocaoglu MB, Gaffan EA, Ho AK. The Huntington's Disease health-related quality of life questionnaire (HDQoL): a disease-specific measure of health-related quality of life. Clin Genet. 2012;81(2):117–122.
  • Yorke J, Corris P, Gaine S, et al. emPHasis-10: development of a health-related quality of life measure in pulmonary hypertension. Eur Respir J. 2014;43(4):1106–1113.
  • Tennant A, Conaghan PG. The Rasch measurement model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Rheum. 2007;57(8):1358–1362.
  • Siemons L, ten Klooster PM, Taal E, et al. Modern psychometrics applied in rheumatology-a systematic review. BMC Musculoskelet Disord. 2012;13:216.
  • Rusch T, Lowry PB, Mair P, et al. Breaking free from the limitations of classical test theory: Developing and measuring information systems scales using item response theory. Inf Manag. 2017;54(2):189–203.
  • Christensen KB, Engelhard G, Salzberger JT. Ask the experts: Rasch vs. factor analysis. Rasch Measurement Transactions. 2012;26(3):1373–1378.
  • McKenna SP, Wilburn J. Patient value: its nature, measurement, and role in real world evidence studies and outcomes-based reimbursement. J Med Econ. 2018;21(5):474–480.
  • Szekeres M. Clinical relevance commentary in response to: The validity and clinical utility of the Disabilities of the Arm Shoulder and Hand questionnaire for hand injuries in developing country contexts: A systematic review. J Hand Ther. 2018;31(1):91–92.
  • Mokkink LB, Terwee CB, Gibbons E, et al. Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) checklist. BMC Med Res Methodol. 2010;10:82.
  • Woodcock AJ, Julious SA, Kinmonth AL, Diabetes Care From Diagnosis Group, et al. Problems with the performance of the SF-36 among people with type 2 diabetes in general practice. Qual Life Res. 2001;10(8):661–670.
  • Mallinson S. The Short-Form 36 and older people: some problems encountered when using postal administration. J Epidemiol Community Health. 1998;52(5):324–328.
  • Velanovich V. Behavior and analysis of 36-item Short-Form Health Survey data for surgical quality-of-life research. Arch Surg. 2007;142(5):473–478.
  • McKenna SP, Heaney A, Wilburn J. Measurement of patient-reported outcomes. 2: Are current measures failing us? J Med Econ. 2019;22(6):523–530.
  • Treanor C, Donnelly M. A methodological review of the Short Form Health Survey 36 (SF-36) and its derivatives among breast cancer survivors. Qual Life Res. 2015;24(2):339–362.
  • Ertzgaard P, Nene A, Kiekens C, et al. A review and evaluation of patient-reported outcome measures for spasticity in persons with spinal cord damage: Recommendations from the Ability Network–an international initiative. J Spinal Cord Med. 2020;43:813–823.
  • Eyles JP, Hunter DJ, Meneses SR, et al. Instruments assessing attitudes toward or capability regarding self-management of osteoarthritis: a systematic review of measurement properties. Osteoarthritis Cartilage. 2017;25(8):1210–1222.
  • Janssen CA, Oude Voshaar MAH, Ten Klooster PM, et al. A systematic literature review of patient-reported outcome measures used in gout: an evaluation of their content and measurement properties. Health Qual Life Outcomes. 2019;17(1):63.
  • Marti C, Hensler S, Herren DB, et al. Measurement properties of the EuroQoL EQ-5D-5L to assess quality of life in patients undergoing carpal tunnel release. J Hand Surg Eur Vol. 2016;41(9):957–962.
  • Qian X, Tan RL, Chuang LH, et al. Measurement properties of commonly used generic preference-based measures in east and south-east Asia: A systematic review. Pharmacoeconomics. 2020;38(2):159–170.
  • Whynes DK, McCahon RA, Ravenscroft A, et al. Responsiveness of the EQ-5D health-related quality-of-life instrument in assessing low back pain. Value Health. 2013;16(1):124–132.
  • Mason SJ, Catto JW, Downing A, et al. Evaluating patient-reported outcome measures (PROMs) for bladder cancer: a systematic review using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist. BJU Int. 2018;122(5):760–773.
  • Craxford S, Deacon C, Myint Y, et al. Assessing outcome measures used after rib fracture: A COSMIN systematic review. Injury. 2019;50(11):1816–1825.
  • Poku E, Aber A, Phillips P, et al. Systematic review assessing the measurement properties of patient-reported outcomes for venous leg ulcers. BJS Open. 2017;1(5):138–147.
  • Angst F. The new COSMIN guidelines confront traditional concepts of responsiveness. BMC Medical Res Methodol. 2011;11:152.
  • Chen XL, Zhong LH, Wen Y, et al. Inflammatory bowel disease-specific health-related quality of life instruments: a systematic review of measurement properties. Health Qual Life Outcomes. 2017;15(1):177.
  • Guyatt G, Mitchell A, Irvine EJ, et al. A new measure of health status for clinical trials in inflammatory bowel disease. Gastroenterology. 1989;96(3):804–810.
  • Wilburn J, McKenna SP, Twiss J, et al. Assessing quality of life in Crohn’s disease: development and validation of the Crohn’s Life Impact Questionnaire (CLIQ). Qual Life Res. 2015;24(9):2279–2288.
  • Irvine EJ. Development and subsequent refinement of the inflammatory bowel disease questionnaire: a quality-of-life instrument for adult patients with inflammatory bowel disease. J Pediatr Gastroenterol Nutr. 1999;28(4):S23–S27.
  • Boye B, Lundin KE, Jantschek G, et al. INSPIRE study: does stress management improve the course of inflammatory bowel disease and disease-specific quality of life in distressed patients with ulcerative colitis or Crohn's disease? A randomized controlled trial. Inflamm Bowel Dis. 2011;17(9):1863–1873.
  • Cross RK, Cheevers N, Rustgi A, et al. Randomized, controlled trial of home telemanagement in patients with ulcerative colitis (UC HAT). Inflamm Bowel Dis. 2012;18(6):1018–1025.
  • Panés J, Su C, Bushmakin AG, et al. Randomized trial of tofacitinib in active ulcerative colitis: analysis of efficacy based on patient-reported outcomes. BMC Gastroenterol. 2015;15:14
  • van Andel EM, Koopmann BD, Crouwel F, et al. Systematic review of development and content validity of patient-reported outcome measures in Inflammatory Bowel Disease: do we measure what we measure? J Crohns Colitis. 2020;14(9):1299–1315.
  • Sierevelt IN, Zwiers R, Schats W, et al. Measurement properties of the most commonly used Foot- and Ankle-Specific Questionnaires: the FFI, FAOS and FAAM. A systematic review. Knee Surg Sports Traumatol Arthrosc. 2018;26(7):2059–2073.
  • Jia Y, Huang H, Gagnier JJ. A systematic review of measurement properties of patient-reported outcome measures for use in patients with foot or ankle diseases. Qual Life Res. 2017;26(8):1969–2010.
  • Morley D, Jenkinson C, Doll H, et al. The Manchester–Oxford Foot Questionnaire (MOXFQ) development and validation of a summary index score. Bone Jt Res. 2013;2(4):66–69.
  • Eechaute C, Vaes P, Van Aerschot L, et al. The clinimetric qualities of patient-assessed instruments for measuring chronic ankle instability: a systematic review. BMC Musculoskelet Disord. 2007;8:6.