Publication Cover
Sequential Analysis
Design Methods and Applications
Volume 27, 2008 - Issue 1
401
Views
0
CrossRef citations to date
0
Altmetric
Original Articles

Discussion on “Second-Guessing Clinical Trial Designs” by Jonathan J. Shuster and Myron N. Chang

Pages 41-45 | Received 27 Oct 2006, Accepted 10 Mar 2007, Published online: 04 Feb 2008

Abstract

While one may impose the group sequential design of one's choice on a completed trial in order to “second-guess” if a trial could have stopped early for either efficacy or futility, such calculations do not consider the mind-set of either those conducting the trial or the data and safety monitoring committee that reviewed the accumulating data. It is thus easy to come to questionable conclusions because of incomplete information. The optimal four-stage designs are interesting, but would be more useful if unequal look times were incorporated and software were made available.

Subject Classifications:

1. INTRODUCTION

The Shuster and Chang paper has two purposes. The first is to provide a method of “second-guessing” whether trials reported in the literature could have stopped early had they been designed in a group sequential manner. This allows those reading journal reports of clinical trials to assess whether a reported trial could have stopped earlier than it did either for efficacy or futility. The method uses the properties of a Brownian motion and superimposes group sequential designs on the completed trial. The second purpose is to provide four-stage group sequential designs that are optimal under a loss function that is a linear combination of the Type I and Type II errors and the average of the expected sample sizes under the null and alternative hypothesis. This is a variant of designs provided earlier by the second author, who included only the expected sample size under the alternative hypothesis in his loss function. My comments will primarily address second-guessing, and I will comment briefly on optimal four-stage designs.

2. SECOND-GUESSING

Large clinical trials sponsored by government or industry are commonly overseen by a data and safety monitoring committee (DSMC), which often looks at interim data semiannually. This committee's considerations are summarized in minutes, but the minutes cannot contain any details that might unblind accumulating treatment results. Thus, without documentation in the literature or the cooperation of the trialists, the data set as analyzed at interim analyses will not be available, even if the final data set itself is released. This is because the interim data may not have been altogether up to date at the time of the interim analysis. It is commonplace that the trial “results paper” mentions in a sentence or two that a certain number of interim analyses took place according to a specified stopping rule, and that the trial was monitored by a DSMC. It is usually stated that either the trial continued to its planned conclusion or was stopped early. With the limited number of words that journals permit, the authors stress other trial methods and results, and generally do not reveal further detail of interim analyses. Perhaps making these details available with the final data set would be helpful to those wishing to second-guess, but as Wittes (Citation1993) said, the considerations that led to stopping or not stopping a trial occur behind closed doors. They are not necessarily revealed by the data alone.

The philosophy of DSMCs differs from one trial to another. Some DSMCs maybe anxious to stop a trial for futility (perhaps in the pharmaceutical industry); others will believe that current practice will not change if such early stopping would occur. Wittes (Citation1993) emphasizes that the proceedings of a DSMC are confidential and often never revealed. A calculation such as the one proposed here will not consider nonstatistical aspects of trial monitoring, such as investigator or DSMC philosophy.

Two examples will illustrate the point. The Digitalis Investigation Group (Citation1997) reported on a trial, known as DIG, that randomized heart failure patients to Digoxin or a placebo, in addition to the usual care these patients received. DIG was designed as a large, simple trial with overall survival as its primary outcome. The drug Digoxin had been in use for many years, so the toxicity patterns were well known and there was little evidence that the drug was harmful. The question of whether the drug improved survival in heart failure patients remained unanswered. The trial was monitored by a DSMC that met every six months. Accrual occurred between February 1991 and August 1993, and 6,800 patients were entered and followed up for an average of 37 months.

The trial was not stopped early, and the outcome was that there was no survival advantage to those given Digoxin. Interim accumulating data were not published, but the published time-to-death distributions were practically overlapping. It is clear that if a second-guess design were imposed on the trial, one would conclude that the trial could have stopped early for futility.

The second-guess analysis would not consider that there were secondary outcomes of great interest in DIG. Mortality due to worsening heart failure was of borderline significance, with those on Digoxin faring better than those on placebo with (uncorrected) p-value 0.06 at the end of the trial. In addition, incidence of death or hospitalization for heart failure was significantly improved in the Digoxin arm with (uncorrected) p-value 0.001. Second-guessing in this case would have decreased the impact of the negative primary result, and readers would have been left not knowing what might have happened had the trial continued. They would also wonder about the effect of Digoxin on mortality for worsening heart failure. Even if the incidence of death or heart failure result looked promising, the implications on overall survival would be unclear. Continuing the trial did not harm the patients and allowed definitive results on a commonly used drug to emerge.

The nature of the stopping rule of the DIG trial (not published) made it unlikely that the trial would have stopped early for efficacy. The DIG investigators were not interested in early stopping unless there was overwhelming evidence of a survival advantage or disadvantage, and so a Peto-type rule was used (see, e.g., Geller and Pocock, Citation1987). When an interim analysis occurred, the critical p-value was 0.001 until the last analysis, when a critical p-value close to 0.05 would be used. It is not surprising that this stopping rule was not reported, because the interim analyses had little effect on the overall p-value required for statistical significance.

Now let us consider the fictional scenario that the DIG trial continued to its scheduled completion, but had a positive result. It might have occurred that the survival curves separated from the start, but that the difference in survival distributions never reached sufficient significance to stop the trial early (i.e., interim p-values always greater than 0.001). A second-guesser might come to the conclusion that the trial was extended far beyond the time it needed to be, because the stopping rule chosen for second-guessing would make it appear evident that early stopping should have occurred.

The relevance of the fictional DIG scenario is that the International Sudden Infarct Study 2 investigators might well have had a similar view of their trial as the DIG investigators had of theirs: early stopping would not change practice unless the results demonstrated overwhelming statistical significance. Thus, second-guessing cannot anticipate the mind-set of either the investigators conducting the trial or the DSMC reviewing accumulating data, and therefore may reach conclusions based on inadequate information.

A second more recent example is the hormone replacement trial (HRT) of estrogen plus progesterone (E + P) of the Women's Health Initiative. A group of 16,608 healthy postmenopausal women with intact uteri were randomized in equal proportion to receive either a placebo or E + P, and follow-up was planned for 8.5 years. The primary hypothesis under test when the E + P trial began was that HRT would cause a decrease in coronary heart disease (CHD). The trial was stopped 3.2 years early because of an excess of invasive breast cancers and overall evidence that risk exceeded benefit. For details, see Women's Health Initiative Study Group (Citation1998) and Writing Group for the Women's Health Initiative Investigators (Citation2002).

Five years of accrual began in 1993, and the trial was ended in April 2002. The trial was overseen by an independent DSMC. Beginning in the fall of 1997 and at approximately six-month intervals thereafter, the DSMC looked at outcome data, including the primary outcome (CHD), a primary adverse outcome (invasive breast cancer), and several secondary outcomes. Some of the interim data are included in the report. The primary outcome had hazard ratio 1.29 when the trial was stopped; that is, HRT caused an increase in CHD rather than a decrease. Also published is that the negative effect of E + P was most evident at one year.

If we were to undertake the second-guess analysis proposed by Shuster and Chang, we would likely have found that this trial should have stopped for futility very early on. That calculation would not consider that at the time the Women's Health Initiative began, hormone replacement therapy was widely recommended for postmenopausal women and was promoted not only for preventing hot flashes, but for overall cardioprotection, as described in the 1998 paper. The DSMC rigorously followed stopping rules for efficacy and adverse events and recommended stopping the trial when the boundary for invasive breast cancer was crossed. It is not clear that an early stopping for futility would have had sufficient credibility in the medical community, nor would the definitive result on invasive breast cancer be known. The impact of waiting until results were “mature” was that the FDA issued a warning for all estrogen hormone replacement therapy. See Stapleton (Citation2003). As a result, the use of hormone replacement therapy in post-menopausal women decreased greatly. Had the trial stopped early for futility, the impact would not have been nearly as great.

These two examples illustrate some of the difficulties with second-guessing. While hindsight is always better than foresight, clinical trials rarely continue gratuitously. The calculations proposed by Shuster and Chang may be amusing to statistical readers of the medical literature, but those who employ them should be aware that the calculations will not lend insight into the reasons a trial continued or stopped as it did.

3. FOUR-STAGE REFERENCE DESIGNS

Group sequential designs with optimal properties has been a subject of interest for many years and Shuster and Chang contribute here by providing reference designs of four equally spaced stages which minimize a cost function, the linear combination of Type I error, α, Type II error, β, and the average of the expected sample sizes over the null and alternative hypotheses. The designs are found by a computer algorithm that could be helpful if made available. While the designs might be useful for second-guessing and raising awareness that one should be as efficient as possible when designing a trial, it is difficult to see how the method could be used for unequally spaced looks which are so commonplace. As the authors point out, more work on optimal criteria for group sequential designs should be encouraged.

4. CONCLUSIONS

I thank the authors for a provocative and interesting paper. One should be careful about the conclusions drawn from a “second-guess” analysis of a completed trial, because the data themselves may not be the only consideration that led to continuing a trial or stopping it early.

Notes

Recommended by N. Mukhopadhyay

REFERENCES

  • Digitalis Investigation Group . ( 1997 ). The Effect of Digoxin on Mortality and Morbidity in Patients with Heart Failure, New England Journal of Medicine 336 : 525 – 533 .
  • Geller , N. L. and Pocock , S. J. ( 1987 ). Interim Analysis in Randomized Clinical Trials: Ramifications and Guidelines for Practitioners, Biometrics 43 : 213 – 223 .
  • Stapleton , J. (2003). FDA Orders Estrogen Safety Warnings, Journal of American Medical Association 289: 537–538.
  • Wittes , J. ( 1993 ). Behind Closed Doors: The Data Monitoring Board in Randomized Clinical Trials, Statistics in Medicine 12 : 419 – 424 .
  • Women's Health Initiative Study Group . ( 1998 ). Design of the Women's Health Initiative Clinical Trial and Observational Study, Controlled Clinical Trials 19 : 61 – 109 .
  • Writing Group for the Women's Health Initiative Investigators . ( 2002 ). Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women, Journal of American Medical Association 288 : 321 – 333 .
  • Recommended by N. Mukhopadhyay

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.