1,555
Views
23
CrossRef citations to date
0
Altmetric
Research Article

The misinterpretation of the standard error of measurement in medical education: A primer on the problems, pitfalls and peculiarities of the three different standard errors of measurement

Pages 569-576 | Published online: 18 Apr 2012
 

Abstract

Background: In high-stakes assessments in medical education, such as final undergraduate examinations and postgraduate assessments, an attempt is frequently made to set confidence limits on the probable true score of a candidate. Typically, this is carried out using what is referred to as the standard error of measurement (SEM). However, it is often the case that the wrong formula is applied, there actually being three different formulae for use in different situations.

Aims: To explain and clarify the calculation of the SEM, and differentiate three separate standard errors, which here are called the standard error of measurement (SEmeas), the standard error of estimation (SEest) and the standard error of prediction (SEpred).

Results: Most accounts describe the calculation of SEmeas. For most purposes, though, what is required is the standard error of estimation (SEest), which has to be applied not to a candidate's actual score but to their estimated true score after taking into account the regression to the mean that occurs due to the unreliability of an assessment. A third formula, the standard error of prediction (SEpred) is less commonly used in medical education, but is useful in situations such as counselling, where one needs to predict a future actual score on an examination from a previous actual score on the same examination.

Conclusions: The various formulae can produce predictions that differ quite substantially, particularly when reliability is not particularly high, and the mark in question is far removed from the average performance of candidates. That can have important, unintended consequences, particularly in a medico-legal context.

Acknowledgements

I am very grateful to Richard Wakeford for his careful reading of an earlier version of this manuscript, and for his expert knowledge of a number of examinations.

Declaration of interest: The author reports no conflicts of interest. The author alone is responsible for the content and writing of this article.

Notes

Notes

1. It is perhaps also worth mentioning in passing that the abbreviation SEM has at least three very different meanings in statistics, Standard Error of Measurement, as used here, Standard Error of the Mean (which is a different concept entirely) and Structural Equation Modelling (which is an advanced form of path modelling related to multiple regression and factor analysis).

2. In the PMETB recommendations, the formula is actually written as ‘SEM = Standard Deviation ’. The lack of brackets makes it a little ambiguous (and to a pedant, perhaps even wrong) and while Cronbach's alpha is indeed a widely accepted measure of reliability, it is not the only one.

3. If a test had n items, and one obtained a score for the first n/2 items and then a score for the second n/2 items then it can be seen that, assuming the two parts of the test are equivalent (parallel), then the correlation between the two scores is both a split-half reliability and also a test–retest reliability, with it just so happening that the second test follows immediately on from the first.

4. Even in 1936, Guilford put a footnote on page 414, saying, ‘Too often one finds the interpretation of [standard error of measurement] is misstated’.

5. Here, and elsewhere, I use probabilities such as 0.68 when referring to confidence intervals, since otherwise there is a risk of confusion between examination marks (which could be 68%) and confidence ranges. Just as two SDs from a normal distribution includes 0.95 of the values (and hence the conventional test for significance of p < 0.05), so one SD includes 0.68 of the values and that is the conventional range used in educational psychometrics.

6. And in the extreme case, a perfect assessment would have SEmeas = 0, and the candidate would get the same actual score every time.

7. This is a recurrent source of error in basic statistics and relates to the difference between the regression of y on x, the regression of x on y, and the principal axis. When drawing lines through data, people tend to think they are drawing the regression of y on x but are usually representing the principal axis (derived from the principal component). The problem arises because y and x are asymmetric in regression (one is being predicted from the other, and there is the implicit assumption that one measure (the dependent variable, typically y) is measured with error and the other (the independent variable, typically x is measured without error). Conversely, in principal components analysis, x and y (or more properly, y1 and y2) are equal co-partners in the calculation of the principal axis. The point of this is not to confuse, but to emphasise that the diagram is correct but there are good reasons why it does not look so.

8. As a thought-experiment, it is worth noting that were the reliability to be 0 then the diagonal solid line in would be vertical, and the best estimate of the true score would always be at the mean, whatever the actual score obtained.

9. In fact, there is no reason why the confidence interval should even include the actual mark, particularly when reliability is low. As a thought experiment consider a completely unreliable exam with a reliability of exactly 0. By bad luck alone some candidates will score 50, but of course the next time around those candidates will be as likely as any others to have a score of 60, so the best estimate of their true score is 60. Furthermore, there is no reason for believing that the estimate should take any other value than 60, so the confidence interval is 60 to 60 (as reflected in SEest being . Notice also that the PMETB approach would expect that the SEM (i.e. SEmeas) would be , and hence there is a 0.68 probability that those individuals who were merely unlikely the first time around are likely to have a true score in the range 50 − 10 = 40 to 50 + 10 = 60. That is the gambler's fallacy, for remember that this hypothetical examination has a reliability of 0, and hence the numbers it generates are as random as the toss of a coin or the spin of a roulette wheel.

10. There is a rather more complicated problem, alluded to by Hofstee, in which the mark on the previous attempt should be taken into account in assessing the current mark, so that the pass mark should, in effect, be higher when an examination be repeated, lest chance alone eventually means that the candidate passes. The analogy perhaps is with those board games in which one has to use a dice to throw a six to start. Eventually, one must get a six, but that is due to chance alone, and not because one has become more skilful at six-throwing.

11. We are trying to predict a future actual mark here because that is what will say whether or not the candidate actually passes the exam. It would not make sense to predict a future true mark because, given the assumption that only sampling error is important in attaining different results on different occasions, any future true mark must be identical to any past true mark. Of course hard work could increase a candidate's true ability, but that is not the situation being considered here.

12. At one point, Streiner and Norman follow Charter and Feldt to argue that SEmeas ‘applies at the level of the individual’ (p. 192), whereas SEest is used, ‘for all people with a given obtained score’ (p. 193). Later, though it is commented that, ‘the SEM for a given individual is dependent on the people with whom he or she is tested; that is the distribution of scores for that particular sample. This makes little sense; how much a person's score will change on retesting because of sampling error, which is what the SEM reflects, is a function of the individual, not the other members of the group’ (p. 301). There are deep philosophical issues here, but are akin to how much one can infer about the future behaviour of a coin that has been tossed, given that it has come down heads. If it is a standard coin of the realm, identical to all others, it is reasonable to set a prior on that of 0.5, and estimate using that assumption. If the coin is entirely novel then after a single toss the information on its behaviour is minimal, and little more can be inferred. When a candidate takes an exam they cannot be considered independently of all other candidates (and if they could, how would the SEM be calculated anyway?). Ultimately, the discussion reduces to a matter of what are the Bayesian priors we are willing to accept in the case of a single candidate. If we accept nothing then we also can conclude nothing. That may be philosophically pure but the extreme nihilism means that we can probably also conclude nothing about any candidate in any exam, which is not very practical for those trying to distinguish the competent from the incompetent.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 65.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 771.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.