200
Views
3
CrossRef citations to date
0
Altmetric
Articles

A Couple of the Nasties Lurking in Evidence‐Based Medicine

Pages 333-352 | Published online: 16 Dec 2008
 

Abstract

The Evidence‐Based Medicine (EBM) movement is an ideological force in health research and health policy which asks for allegiance to two types of methodological doctrine. The first is the highly quotable motherhood statement: for example, that we should make conscientious, explicit and judicious use of current best evidence (paraphrasing Sackett). The second type of doctrine, vastly more specific and in practice more important, is the detailed methodology of design and analysis of experiments. This type of detailed methodological doctrine tends to be simplified by commentators but followed to the letter by practitioners.

A number of interestingly dumb claims have become entrenched in prominent versions of these more specific methodological doctrines. I look at just a couple of example claims, namely:

  • Any randomised controlled trial (RCT) gives us better evidence than any other study.

  • Confidence intervals are always useful summaries of at least part of the evidence an experiment gives us about a hypothesis.

To offer a positive doctrine which might move us past the current conflict of micro‐theories of evidence, I propose a mild methodological pluralism: in any local context in which none of a variety of scientific methodologies is clearly and uncontentiously right, researchers should not be discouraged from using any methodology for which they can provide a good argument.

Acknowledgements

I would like to thank Adam La Caze, Joan Leach, Fiona Mackenzie, Alison Moore, Dick Parker, an anonymous reviewer for Social Epistemology, and the participants in the University of Queensland’s Fourth Biohumanities Workshop, for helpful comments on this paper. The University of Queensland’s Fourth Biohumanities Workshop was generously supported by the University of Queensland’s Biohumanities Program and the University of Sydney’s Centre for Time.

Notes

[1] One of the few major differences between the various evidence hierarchies is that some of them have split what they call level 1 evidence into sub‐levels, one of which corresponds to “meta‐analyses”, i.e., statistical amalgamations of a number of experiments on a single issue. This is a good idea. It happens not to affect any of the topics discussed in this paper.

[2] To find my RCT and non‐RCT, I have gone back in time a few years. As EBM has taken hold, it has become increasingly hard to find non‐RCTs in the peer‐reviewed medical literature. (This in itself tells a story about the success of EBM, of course.) There are still some, and there is some hope that the tide is currently turning (thanks to Fiona Mackenzie for reminding me of this); but in any case, by going way back, I have been able to more or less match my RCT and my non‐RCT in subject matter, which aids the comparison.

[3] I’m taking a few things about how school anti‐drug campaigns work for granted here; for example, that school students talk to each other.

[4] Sometimes a large number of small RCTs can be combined in a statistical meta‐analysis to produce worthwhile results. That does not affect the conclusion of this section, which is merely that some RCTs are bad. Idiosyncratic RCTs, or RCTs which have been conducted sufficiently badly, will never become part of a meta‐analysis.

[5] I say “I bet” because I cannot actually test this without access to the raw data.

[6] In any case, as far as I can tell from their paper, the only variables which Meese et al. tested were age and CD4 lymphocyte count, so no amount of statistical analysis could tell whether the intervention and control groups were matched on such important variables as native language. This shows the importance of measuring the right things, something which tends not to be mentioned in the hierarchies of evidence.

[7] As I argued above for Meese et al., MacGowan et al.’s sample size may effectively have been the number of social units, not the number of people, for some purposes. In that case, their sample size was only seven. But even then, they did much better than the bad RCT, because unlike the RCT they were not relying on their sample size to balance variables between the intervention and control groups. If an RCT’s sample size is too small because of a confusion between individual and social levels of analysis, all of its results are suspect, because the analysis of an RCT depends on variables being balanced in this way. But if MacGowan et al.’s sample size was too small for the same reason, no such assumption was made, and all that happened was that they may have confused the effectiveness of methodone clinics in general (based on the individual level of analysis, assuming a large sample size) with the correct conclusion, namely that they’d tested some methodone clinics which may have been idiosyncratically good (the social level of analysis, assuming a small sample size of seven clinics with a large sample size within each clinic). What was a fatal mistake in an RCT is a relatively minor mistake here.

[8] The OCEBM hierarchy is hardly more recent than the US Preventive Services Taskforce hierarchy, but the latter is a very late (although otherwise typical) member of an earlier generation of hierarchies, as Figure demonstrates. I did not use the OCEBM hierarchy in my analyses above because it is much more complicated and uses non‐standard labelling of its categories. We will see that it tells essentially the same story in any case, except for one important improvement: it does take sample size into account, at least sometimes, by requiring narrow confidence intervals in some cases. See below for a discussion of confidence intervals.

[9] “Systematic reviews” are statistical agglomerations of smaller studies: what I earlier called meta‐analyses. There has always been a tendency in England to avoid the word “meta‐analysis”, which some English people find pretentious. For a while, the English tried to popularise the alternative term “overview”; that failed, so now they are trying “systematic review”.

[10] There is a matching category of economic analysis called “Absolute better‐value or worse‐value analyses” which would apply when, for example, an intervention made something cost more without incurring any benefit.

[11] The OCEBM hierarchy is promoted by a web site which is focused on clinical health care, so the fact that the hierarchy is also to some extent focused on clinical health care is by no means a criticism in itself.

[12] Incidentally, I do not believe that even the narrow claim is entirely correct, although clearly it is vastly more defensible than dumb claim 1 (so much so that I would not call the narrow claim dumb).

[13] In criticising confidence intervals, I do not wish to imply that researchers should return to previous methods of significance testing. Most, and possibly all, of the problems with confidence intervals apply mutatis mutandis to standard hypothesis tests, statistical significance, p‐values etc., since (e.g.) p‐values are strictly intertranslatable with (isomorphic to) the end‐points of confidence intervals, with only a few exceptions. When these exceptions occur, there is great (and interesting) uncertainty in the medical community about which is right, the p‐value or the confidence interval. See Grossman et al. (Citation1994) for an example.

[14] There are statistical procedures which can assign probabilities to hypotheses, although they have their own drawbacks. The best known are Bayesian procedures. These days Bayesian methods are usually criticised for being excessively subjectivist, but non‐subjectivist Bayesian methods exist and, in fact, were one of the main foils against which confidence intervals were originally intended to compete (Neyman Citation1937).

[15] This example has been discussed to death in the mathematical literature (Welch Citation1939; Robinson Citation1975, Citation1977, Citation1979; Berger and Wolpert Citation1988), so I can be sure that there are no lurking infinities or other gotchas.

[16] I say “a 75% confidence interval” rather than “the 75% confidence interval” to avoid wasting space by discussing the fact that confidence intervals are not unique. Even though they are not unique, the reader can trust me that there is no other confidence interval which makes more sense of this example. Note by the way that there is no 95% confidence interval in this case, unless we define one which is even more artificial and problematic than the 75% confidence interval.

[17] I do not claim that it is the fault of the progenitors of EBM that it has ended up covering such an immense field. To some extent, the colonies of EBM may have sucked the colonists onto themselves. This interesting and important question is not my current topic.

[18] Personal communication, Mahesh K. B. Parmar, Cancer Division, Medical Research Council Clinical Trials Unit, 13 July 2007; Donald A. Berry, M. D. Anderson Cancer Center, University of Texas, 15 July 2007

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.