Publication Cover
Sequential Analysis
Design Methods and Applications
Volume 27, 2008 - Issue 1
310
Views
0
CrossRef citations to date
0
Altmetric
Original Articles

Authors' Responses

&
Pages 50-57 | Received 28 Aug 2006, Accepted 15 Nov 2006, Published online: 04 Feb 2008

Abstract

The risk of rampant second-guessing of the design and/or results of randomized clinical trials could lead to reluctance of scientific experts to participate in the management or oversight of these important vehicles that test new therapies. Under the premise that second-guessing is a growing issue, we considered it to be a very important exercise to use sound scientific footing in order to do so. Given the controversial nature of the subject material, the editor obtained discussions from a group of eight experts from multiple points of view. These discussions add a great deal to our publication. This article will provide our summary response to both practical and technical issues underlying their discussion.

Subject Classifications:

1. INTRODUCTION

We are grateful to the discussants, all of whom contributed very useful information on the monitoring of clinical trials and to the concept of second-guessing of clinical trial designs, as provided in our article. As expected, there is diversity of opinion about this issue, ranging from very practical matters to very theoretical ones. We shall offer our commentary on a variety of issues raised. In Section 2, we shall look at the interaction between second-guessing and data and safety monitoring committees. Section 3 considers the argument for fully sequential designs, which might obviate or eliminate much of the second-guessing. In Section 4 we shall look at an issue of whether or not sequential trials need stoppages of accrual in order to determine if it is appropriate to put new subjects onto a trial. The final section is devoted to technical issues, including potential new areas of research, such as second-guessing in adaptive designs, multitreatment designs, and small sample issues.

2. SECOND-GUESSING AND DATA AND SAFETY MONITORING COMMITTEES

Four of the discussants, Professors Rosenberger, O'Brien, Ellenberg, and Dr. Geller discuss second guessing in the context of the difficult real world decisions that data and safety monitoring committees (DSMCs) must face. We agree with most of the thoughtful and constructive comments made by these discussants, who have added considerable insight into the practical aspects of conducting randomized clinical trials. The major themes are (1) stopping rules should be treated as guidelines, subject to medical evaluation over and above any statistical evaluation, and (2) the deliberations of DSMCs are not often made public, perhaps due to lack of journal space, and for that reason, second-guessing based on incomplete information may seem unfair. Professor Rosenberger went as far as to suggest that DSMC members might face lawsuits. However, it needs to be pointed out, as cited in our paper, that since second-guessing is a growing issue in clinical trials, and since it is here to stay, it might be best to do the second-guessing on the basis of sound science, rather than on other factors. As statisticians, we are accustomed to making inferences from incomplete information, and the methods we proposed use the published information to do just that. While the first author was the group statistician of the Pediatric Oncology Group (POG), a father of a six-year-old who died on the inferior treatment of a POG trial asked him the legitimate question, “Did you know the treatment was inferior at the time when my child was randomized?” With the Internet and trial registries, second-guessing indeed is getting to the patient/family level. We suspect that our tools will usually be supportive of the actual designs, rather than be critical.

All four of these discussants point out that even though a stopping barrier has been crossed, the study is not automatically terminated. Reasons cited include the evaluation of important secondary endpoints such as toxicity, need for knowledge about other aspects of the treatment or disease, and suspicions about the validity of the model that assessed the efficacy. In our own discussion (Section 6 of our paper), we strongly recommended that the publication of the trial disclose the interim monitoring plan and the efficacy results for all interim monitoring analyses. If an efficacy (futility) barrier is crossed, with the exception of the rationale of potential model invalidity (e.g., nonproportional hazards in survival analysis), the study conclusion about efficacy should be based on this interim conclusion, and not upon the final data. To do otherwise would adversely affect the study's operating characteristics (power and Type I error). The paper should also include in the discussion the rationale as to why the study was continued. As cited by these discussants, there may be valid reasons to continue a trial, as long as the DSMC remains in equipoise about the impact upon present and future trial participants, and whether in their judgment withholding the results from the general public will not be harmful to public health at large.

These same four discussants all point out good examples where the DSMC must wrestle with difficult decisions after a barrier is crossed. But first, it is critically important that the committee restrict itself to the two functions in their title: (1) are high quality data being accrued in a timely way (monitoring data quality and accrual rates), and (2) is the safety of patients (present and future) being properly protected? To protect themselves from potential conflicts of interest, there must be a firewall that prevents the members of the DSMC from crossing the line and becoming de facto study investigators. The DSMC is advisory to the investigators, sponsors, and institutional review boards. All of these discussants suggest that studies whose primary efficacy measure crossed the stopping barrier might be allowed to continue because of important secondary endpoints. In principle, we agree with this. However, there is a legitimate second-guess question as to why a univariate stopping rule was employed when a multivariate stopping rule could have obviated this problem. Unfortunately, under the present state of knowledge, multivariate monitoring methodology to resolve this issue is lacking. The recent work of Mor and Anderson (Citation2005) is a step in the right direction, but needs further development before it can be widely adopted. A similar issue applies to the second example of Professor O'Brien regarding subsets. For example, if an interim analysis crosses the efficacy barrier overall, is it ethical to continue the study as is, because definitive results have not been seen in one or both genders? After all, both males and females are part of the overall target population. Continuation of the study on this basis alone would be clearly inferred as harmful on average to future participants who are assigned to the inferior treatment. New sequential methods for jointly monitoring overall and subset efficacy might resolve this issue. However, if the overall monitoring rule says that efficacy has been established overall, continuation should only be allowable if there is a sequentially demonstrated subset by treatment interaction. Under such circumstances, a partial closure should still be undertaken. The DSMC also needs to ask the question of whether a gender-specific efficacy analysis was a major objective of the study before recommending even a partial continuation.

Professor Ellenberg pointed out a real example described in Wheatley and Clayton (Citation2003), where a trial that used survival analysis had an interim analysis that apparently crossed an efficacy barrier, but was not stopped. The rationale was that the effect size seemed unrealistically high at this point. The study was kept open to randomization and looked at 6 months later rather than the planned 12 months later. Despite being similarly significant, the randomization was kept open with another look 6 months later. In the test of time, the treatment effect became nonsignificant. Because of the double jeopardy, the study had lower Type I error and lower power than advertised. It is unclear from the paper whether or not the study had a true group sequential analysis plan that covers the study-wise overall Type I and Type II errors, or whether the interim analyses actually conducted were or were not part of a prospective study design plan. In either case, one option that could have been employed to protect subjects from potential harm would have been to suspend accrual but obtain longer follow-up on the patients already enrolled in the trial. As noted in Wheatley and Clayton (Citation2003), the same conclusion would have been reached by this strategy, and the randomization could have been restarted. A nice feature of Brownian motion under proportional hazards is that the sequential properties are unaffected by gaps in new patient accrual. If indeed this was an unplanned interim analysis, suspension of accrual to clear up the equipoise issue is perfectly legitimate. The DSMC has a higher priority to protect patient welfare than to finish a trial on time. It needs to be noted that Professor Ellenberg presents this as an example where second-guessing might be unfair, and yet it appears that the actions of the DSMC ignored the safest option (C) among the standard committee choices: (A) continue unaltered, (B) continue with alterations, (C) suspend accrual, and (D) terminate the trial.

Dr. Geller and Professor Ellenberg both make the point that DSMCs worry about universal acceptance of a trial result where an efficacy barrier has been surpassed in an interim analysis. The example cited in our paper, the International Sudden Infarct Study 2, might be a good case in point. The patient outcome is available just five weeks after randomization, so no model issues are at stake. Had the trial terminated early for crossing an efficacy barrier, as almost certainly would have been suggested by any reasonable group sequential procedure, then indeed the confidence interval for the effect size would have been wider than it was under the full accrual. But balanced against that is that fact that from that point, 25% of the patients received the inferior double placebo treatment, and that the trial results were withheld from the public from the time of crossing a hypothetical barrier to the time of final publication. As we noted in our article, giving convincingly inferior therapy just to tighten up a confidence interval seems to be unethical.

In her abstract, Professor Ellenberg states that for trials aimed at relief of symptoms rather than addressing a serious health issue, stopping for benefit is not a consideration. We agree that there are many trials that fall under this rubric, but institutional review boards, not the investigators or sponsors, should make that call when the study is initially designed. Ideally, any trial that meets the following two conditions needs to have a sequential monitoring plan: (1) there is public health consequences of efficacy and/or futility, and (2) the results are accrued over an extended period of time in which meaningful interim efficacy information would be available and methodologically analyzable in a timely manner. Unfortunately, there is still a gap in our arsenal of statistical tools to cover all types of clinical trial endpoints, so that condition (2) can be easier said than done.

Professor O'Brien, in Example 2.3, suggests that a trial that meets a futility barrier may need to be continued for toxicity comparisons. This would indeed be reasonable in many contexts. But if the experimental intervention is between an active agent and placebo, the toxicity question becomes moot. Further, in cooperative trial networks, such as TRIALNET (diabetes), Aids Clinical Trial Networks, or the National Cancer Institute Cooperative Groups, the DSMCs need to recognize the limited opportunities for research that may be hindered by ignoring the crossing of a futility boundary. Are future patients better served by allowing the trial to continue, or by moving up the next generation of trials? Nonetheless, there can be good rationale for having no futility provision in a clinical trial. As the discussants note, some trials have no major safety concerns, and important ancillary translational research questions might be compromised by early termination.

Recommendations: Based on the excellent discussion on this topic by the discussants, the following are recommended to ensure protection for investigators and DSMC members from inappropriate second-guessing: (1) a set of minutes of DSMC confidential meetings and recommendations be maintained and be made available to the journal editors at the time of submission of the major outcome manuscript, and (2) no letter or article that second-guesses a trial should appear without a companion discussion from the trial's lead authors.

3. ARGUMENTS FOR FULLY SEQUENTIAL TRIALS

Professor Gombay makes very effective arguments that today, many trials can and probably should be monitored continuously, rather than in a group sequential manner. Methodology to deal with nuisance parameters has advanced over the last two decades, while at the same time, the use of the Internet has speeded up the process of data acquisition. It is possible to develop scripts that run daily to alert the project statisticians that an urgent meeting of the DSMC might be needed to discuss continuation of the trial. In arguing for a fully sequential analysis, Professor Gombay cites an example where an early difference (due to early deaths on one treatment) was later overturned. This is similar to the example cited by Professor Ellenberg from Wheatley and Clayton (Citation2003). A fully sequential method utilized in survival analysis needs to be either very conservative early on or allow for temporary stoppages in accrual to determine whether or not this early death issue is spurious. Otherwise, it is more likely to reach an early conclusion than a group sequential method would. Professor Gombay also presumes that fully sequential methods will virtually eliminate the need for second guessing. Her argument may have some merit, but with an infinite array of fully sequential methods with the same operating characteristics, consumer questions as to whether the decision could or should have been made earlier are bound to arise. Finally, although information can travel quickly, results are usually obtained over a finite time interval, which hampers the application of fully sequential methods. The next section will look at this issue in more detail.

4. TEMPORARY STOPPAGE OF ACCRUAL

Professor O'Brien raised the issue of temporary stoppage of accrual, indicating that it is undesirable and uncommon. We saw in the Wheatley and Clayton (Citation2003) example that this would have been the safest approach in their trial. Moreover, let us consider a surgical trial where the outcome is binary (alive/dead) to get through hospitalization and rehabilitation (generally a period of six weeks). If an interim analysis and DSMC meeting is planned after the 600th patient's outcome, then it is argued that under either of the two scenarios below, accrual should be halted immediately after the 600th patient is accrued, and not reopened until it is established that the study is to be continued. Scenario 1: the study is open label, and it is predicted that there is a realistic chance that a stopping barrier will be crossed. Scenario 2: the study is double blind irrespective of the prediction of where the interim analysis will fall out. In Scenario 2, if we operate per Scenario 1, the participating physicians may have a hint of a difference if accrual is halted. The practical advantage of group sequential methods over fully sequential methods is that the former will limit the number of temporary closures. Whatever our stopping rules happen to be, are newly accrued patients adequately protected when the design calls for a yet uncompleted interim analysis on the patients accrued before them (whether group sequential or fully sequential)?

5. TECHNICAL COMMENTS OF REVIEWERS

Professor O'Brien suggested that these reference designs be compared to other existing designs. For equally spaced univariate designs with prospectively designed nonadaptive studies with four equally spaced looks and operating characteristics (Type I and Type II errors), our designs are superior to all competitors under this utility (defined by the mean of the expected sample size under the null and alternative hypotheses), and based on grid searches are nearly optimal under other loss functions. It would be valuable to quantify the potential advantage for adaptive procedures over and above the designs proposed in our paper. He further suggested not worrying about power and Type I error (which we matched to the actual trial), but compare the actual design to ours with the same upper bound for sample size. We consider this a somewhat unfair advantage for the second-guess design, which would be underpowered, and therefore have smaller average sample sizes. However, one can get the average sample numbers for doing just that by multiplying the E(θ) entries of Table 4 by 0.75. Finally, Professor O'Brien suggests that our methods are Bayesian. It is true that we employed Bayesian methods to narrow the field of competing designs, but the resulting framework is strictly a frequentist one.

Dr. Geller suggests that the optimal designs be extended to unequal information fractions, and this comment appears in the summary of our article. She also suggested that software be made available. A Pascal program can be obtained upon request from the second author, Professor Myron Chang, at [email protected].

As noted above, Professor Gombay suggests that fully sequential methods are feasible in far more trials than previously possible. They too can be designed with appropriate operating characteristics. If one can indeed accomplish this, especially in trials that have endpoints that are obtained in a short period of time relative to the total maximum planned accrual duration, one should be able to design much more efficient trials than the traditional group sequential trial. Research on optimizing the cut points for fully sequential studies would be a daunting but important research project.

5.1. Responses to Comments by Dr. Biswas

Dr. Biswas suggests that more flexible designs could be used. The first recommended set of designs spend the total Type I error rate α at k stages according to a Type I error spending function (Lan and DeMets, Citation1983). The Type I error spending function is determined before the trial begins. The designs proposed by Lan and DeMets are widely used in clinical trials because of the flexibility on timing of interim analyses. As noted by Dr. Biswas, “In a similar spirit, one can as well spend the Type II error in different stages, in a group sequential manner.” Equipped with both the Type 1 and Type II error spending functions, designs will satisfy the Type I error requirement and will approximately satisfy the Type II error requirement (Chang et al., Citation1998).

A variety of “adaptive designs” are recommended by Dr. Biswas. In contrast with classical group sequential designs, which are obtained before the trial begins, adaptive designs use available interim data at each stage to dictate the future course of the trial. Adaptive designs have received considerable attention recently because of their greater flexibility and potentially improved efficiency. In general, there is little argument about the value of adaptive designs for early clinical trials, such as dose determination studies, but there is controversy about using them in later trials, such as Phase III clinical trials. A detailed discussion on adaptive designs is beyond the scope of our paper.

Dr. Biswas proposes an adaptive design that is determined “optimally” at each stage based on the available data before that stage. The optimization pursued in our paper is to generate a design that globally satisfies both Type I and Type II error requirements and minimizes the average sample size in (3.1), with given sample sizes at each stage. One important step is to obtain stopping boundaries at all stages to minimize (3.2) for given b 0, b 1, and b 2 by the method of backward induction. The stopping boundaries are determined in the order from the last stage to the first stage. For all given stopping boundaries from the last stage k to stage i + 1 and for a given value of the test statistic at stage i, we evaluate the contribution to (3.2) of three actions: reject H 0, accept H 0, and continue to stage i + 1. The test statistic value is then put in the rejection region of H 0, the acceptance region of H 0, or the continuation region at stage i according to the action with the smallest contribution to (3.2). Conversely, the adaptive procedure is a forward procedure. Without information on stopping boundaries from stage i + 1 to the last stage k, it is difficult to compute the average sample size after test statistics from the first stage to stage i are observed. Therefore, it seems that our method is not directly applicable to adaptive designs. However, the optimal adaptive designs may be obtained when the conditional error functions are limited in a family that is specified by a few parameters.

5.2. Responses to Comments of Dr. Coad

We sincerely appreciate the discussion of Dr. Coad, whose constructive suggestions looked into areas where new statistical methods could be utilized to extend the capabilities of second-guessing. The following areas were suggested. (1) Second-guessing multitreatment trials would be a useful exercise to see if early elimination of one or more inferior treatments could have been accomplished. This could be a very useful exercise in trials where there are both active and placebo controls. For further discussion of active vs. placebo controls, see Temple and Ellenberg (Citation2000). (2) An extension to small sample sizes within stages, based on the t-distribution rather than the normal distribution, would be an important contribution. (Our paper needs large sample stage consistent estimators.) (3) Perhaps more intriguing is the idea of second-guessing a nonsequential or group sequential trial by one that allows the second-guesser to use a group sequential design that allocates patients according to the results of the previous stages, rather than strictly 50–50. It is of special interest for training purposes, where one might or might not demonstrate real advantages of this approach over the conventional 50–50 approach.

5.3. Responses to Comments of Dr. Liu

We appreciate Dr. Liu's concerns and believe that he has uncovered a new scope for future research. But for several reasons, we are afraid that his basic observations may not be entirely relevant to our paper. First, Dr. Liu utilizes one-sided statistics, whereas we included only two-sided statistics (symmetric in the effect size Δ). Today, because conventional wisdom has often proven to be incorrect, two-sided methods are accepted as standard tools in most areas of clinical trials. This concern stems from the fact that a two-sided method with p = 0.05 and a one-sided method with p = 0.025 are not the same. Hence, we remain unsure about the relevance of Dr. Liu's “working example” in the light of what we have written in our paper. Second, the premise of Dr. Liu's comments rests heavily on preselecting one of the four group sequential options rather than relying upon global optimization. Third, Dr. Liu employs a prior distribution on Δ that differs markedly from ours.

There is a fundamental difference between the interesting approach to second-guessing that Dr. Liu proposes and what we propose. Consider the statement in the review: “Analogous to post-hoc data analysis, by which it would be very likely to find a model or an analysis that produces a desired result, second-guessing selects a design that fits the given data.” To deal with this potential issue, we restricted the competing (second-guess) design choice to a single reference design. Dr. Liu's method allows the choice of design to be a function of the final test statistic (conditional optimization). Some confusion on this issue may stem from our illustrative use of the Pocock and O'Brien-Fleming methods in our two numerical examples. We did so because these were the readily available group sequential designs at the time the studies were actually conducted. Had the studies used group sequential methods, we expect that one of these two approaches probably would have been utilized. None of our work involved design hopping.

It is hard for us to recast Dr. Liu's concerns in light of our proposed paradigm. However, we have no major disagreement with Dr. Liu on one major point: “Unless proper guidance is followed, exercising second-guessing could be imminently detrimental to statistical science. The second-guessing for a nonsequential trial of the type discussed in our paper asked the single question as to what the distribution of stopping times would have been under a single completely specified reference design. It does not second-guess any actual conclusions of the trial.

Notes

Recommended by N. Mukhopadhyay

REFERENCES

  • Chang , M. N. , Hwang , I. K. , and Shih , W. J. ( 1998 ). Group Sequential Designs Using Both Type I and Type II Error Probability Spending Functions, Communications in Statistics—Theory & Methods 27 : 1323 – 1339 .
  • Lan , K. K. G. and DeMets , D. L. ( 1983 ). Discrete Sequential Boundaries for Clinical Trials, Biometrika 70 : 659 – 663 .
  • Mor , M. K. and Anderson , S. J. ( 2005 ). A Bayesian Group Sequential Approach for Multiple Endpoints, Sequential Analysis 24 : 45 – 62 .
  • Temple , R. and Ellenberg , S. ( 2000 ). Placebo-Controlled Trials and Active-Control Trials in the Evaluation of New Treatments. Part 1: Ethical and Scientific Issues, Annals of Internal Medicine 133 : 455 – 463 .
  • Wheatley , K. and Clayton , D. ( 2003 ). Be Skeptical about Unexpected Large Apparent Treatment Effects: The Case of an MRC AML 12 Randomization , Controlled Clinical Trials 24 : 66 – 70 .
  • Recommended by N. Mukhopadhyay

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.