Publication Cover
Sequential Analysis
Design Methods and Applications
Volume 27, 2008 - Issue 1
406
Views
3
CrossRef citations to date
0
Altmetric
Original Articles

Editor's Special Invited Paper: Second-Guessing Clinical Trial Designs

&
Pages 2-20 | Received 29 Jun 2005, Accepted 07 Apr 2007, Published online: 04 Feb 2008

Abstract

This article has two major purposes. First, we propose a methodology that can help biostatistical reviewers of nonsequential randomized clinical trials, armed only with the single summary statistic from the trial, ask the “what if” question “Could a group sequential design have reached a definitive conclusion earlier?” As a side benefit of this research, four-stage group sequential designs that are optimal in some sense are obtained to serve as reference designs for assessing an actual design against a reasonable alternative design with the same power as the original. Four-stage designs were chosen as a practical limit on the number of stages most practitioners would consider, given the inconvenience of interim analysis, including the need to close studies to accrual while the required follow-up information is collected. Since journal editors and the public can subject the trial to close scrutiny after the fact, this new capability could alter the mind-set of some investigators designing clinical trials. Two real examples that were heavily criticized for staying open too long will be presented.

Subject Classifications:

1. INTRODUCTION

In conducting a clinical trial, a peer-reviewed document called the protocol is written before the trial commences, as a manual of procedures. Among other things, it spells out in detail the study objectives, patient recruitment and eligibility, data fields to be collected, timing of medical tests being undertaken, patient follow-up schedule, and methods and precise timing of data analysis.

In fact, Section 4.5 (“Interim Analysis and Early Stopping”) from the International Conference on Harmonization's “Guidance on Statistical Principles for Clinical Trials” (1998) states, “Because the number, methods, and consequences of these comparisons. affect the interpretation of the trial, all interim analyses should be carefully planned in advance and described in the protocol.” While it is good science to plan a protocol and run it as planned, the public will often want to know if the answers to the research questions could have been made available earlier, perhaps because a family member or a physician's patient received an inferior treatment in the latter stages of the trial.

This article has two important features. First, provided that journal reports of clinical trials have the appropriate information, the methods of this paper allow the reader to second-guess what would have happened under a different design, and for single-stage designs, to do so using only the published Z-scores (square roots of single degree of freedom chi-squares). For group sequential designs, the Z-scores and information fractions associated with each Z-score are needed. Since reanalysis of the full historical database would not be needed, this methodology will make wide-scale, scientifically sound second-guessing feasible for the first time. Although the methods can be used to superimpose a different sequential design over an actual sequential design, the most important application will be the superimposition of a sequential design over a nonsequential design. The idea of second-guessing trials is not new. Rosner and Tsiatis (Citation1989), armed with the historical databases, reviewed 72 nonsequential studies from the Eastern Cooperative Oncology Group. While none of these studies had a reversal in the general conclusions, several could have been halted early under a sequential plan, although some might have actually extended accrual under certain group sequential methods.

The second purpose of the paper is to provide readers with four-stage group sequential “reference designs” that are optimal under a reasonable risk function, over equally spaced look designs. These designs are used for tabulation of second-guess properties of nonsequential designs with these designs superimposed. These reference designs can be viewed as robust, in the sense that they perform well under different optimality conditions, against designs that are optimized for these alternate criteria over a large discrete grid of competing designs.

There may be valid reasons why investigators may be unwilling to adopt a sequential plan for their study: it may be impractical, patient outcome results may come in too late to be of any use for interim analysis, the investigators may distrust proportional hazards in a survival analysis setting, the consequences of good vs. poor results may not affect the health of present and future participants, or the study might be too small to consider sequential designs. Nonetheless, it is of interest to provide tools for readers of medical articles asking the question “What is the likelihood that a logical sequential design would have reached a conclusion at an interim analysis?” Sequential designs generally can shorten the expected duration or reduce the expected number of patients entering the trial. If early termination occurs, not only will potential entrants be spared from receiving inferior therapy (or useless but more toxic therapy), but the results of the trial can get to the public earlier, when a much larger general population can benefit from the results. The sequential design pays a price of less precise estimates of effect sizes than nonsequential trials, but giving convincingly inferior therapy to individuals or delaying publication of important research solely to enhance the precision of an estimate seems to be unethical. However, Souhami (Citation1994) makes an excellent point that if a toxic cancer treatment has a lower limit of a group sequential confidence interval of adding a mere 2% to long-term survival, the general public may shun this treatment, which they might otherwise have utilized had the trial been run as a nonsequential study to its maximum accrual and the confidence interval tightened. A confidence interval that excludes zero may indeed be insufficient evidence to act upon in such a scenario. But this is simply a matter of properly defining the null hypothesis to be the minimum clinically significant improvement needed, rather than a zero difference. In essence, the break-even point could be shifted away from zero to offset the higher expectation in side effects of one treatment arm.

For practical purposes, clinical trials are rarely conducted under a continuous sequential analysis mechanism. Generally, group sequential methods are used. This entails conducting analysis at discrete times specified in the design. The most notable of group sequential methods were developed by Pocock (Citation1977) and O'Brien and Fleming (Citation1979), and have been used successfully for over a quarter of a century. Important articles of Simon (Citation1991), who cites group sequential methods as a major advance in statistical methods in medicine during the 1980s; Fleming and DeMets (Citation1993), who tie the methods to data and safety monitoring committees; and Ellenberg (Citation2003), who points out that there is wide diversity in both opinion and in practice concerning interim stopping rules, provide excellent motivation that group sequential designs should be used in current trials wherever feasible. Optimization of these designs in terms of average sample size is a daunting problem that has been solved under limited circumstances. Pocock (Citation1982), Geller and Pocock (Citation1987), Chang (Citation1988 Citation1996), Therneau et al. (Citation1990), Eales and Jennison (Citation1992 Citation1995), and Barber and Jennison (Citation2002) have produced optimal designs for equally spaced looks and various loss functions. Apart from Chang (Citation1988 Citation1996), none of these methods have real provisions for early stopping for nonsignificance (futility), something that should be an important part of modern trials. While the Chang designs do have allowance for early acceptance, the optimization was under the alternate hypothesis only. Barber and Jennison (Citation2002) have optimized the sample size under a one-sided alternative, but this does not really accommodate futility, since powerful evidence for accepting the null value is required for early termination, whereas futility requires much less evidence on that side of the ledger. Hence, the earlier work notwithstanding, if there is to be second-guessing, we need to develop new reference designs that accommodate futility, and are optimal in some sense. Due to the futility issue, we opted to tweak the Chang (Citation1988 Citation1996) methodology to obtain this objective.

In the next section, we shall present the general Brownian motion approach to the sequential progress of a typical clinical trial, along with special cases for the Pocock (Citation1977) and O'Brien and Fleming (Citation1979) methods. In Subsection 2.1, we present numerical examples from “ISIS #2,” the second International Sudden Infarct Study (Citation1988), an actual clinical trial involving over 17,000 patients, where overwhelming statistical significance of the major study questions resulted. Section 3 is devoted to defining optimal four-stage designs that allow for futility. With these particular designs as references, Section 4 and Table 5 provide the reader with the necessary materials to utilize the final Z-score of a nonsequential trial to make probabilistic statements about how a group sequential design would have fared in terms of early termination of the trial. To be fair, we shall superimpose the design that has the same power as the actual nonsequential design, thereby allowing for the possibility that the study might actually accrue for a longer period of time than the nonsequential design. Section 5 describes how similar methods can second-guess group sequential designs on the basis of other group sequential designs. The final section is devoted to a discussion.

The methodology presented here is strictly for two-sided tests. But the methods are easily adapted for one-sided tests.

2. BROWNIAN MOTION FRAMEWORK FOR SECOND-GUESS METHODOLOGY

Applications that are well approximated by Brownian motion include the one-sample tests for means and proportions. In addition, for two sample studies, where the percentages allocated to each treatment are approximately constant over the trial (e.g., about 50–50), the following are also well approximated by Brownian motion: the Z-test for differences in proportions or means, the Wilcoxon test (see Jones and Whitehead, Citation1979; Shuster et al., Citation2004), analysis of covariance for two groups with a random covariate, or with caution to survival analysis under proportional hazards per Tsiatis (Citation1981). Asymptotically, the nuisance parameter (σ2 per Eq. (Equation2.1)) can be replaced by a stage-specific consistent estimator.

The typical model has a summary statistic, distributed approximately as in (Equation2.1), as implied by Brownian motion:

where N(μ, γ2) represents the normal distribution with mean μ and variance γ2, while θ = 1 represents the fixed sample size of the nonsequential design.

The process also has independent, stationary increments, with

for the two-sided test of the null hypothesis H 0: Δ = Δ0 = 0 vs. H a :Δ = Δ a , the fixed sample size (nonsequential) study will reject H 0 if
where Z α/2 is the upper 100[1 − (α/2)] percentile of the standard normal distribution. (Δ0 = 0 is not a restriction, since otherwise the process can be centered by subtracting θΔ0 over the entire process.)

The typical group sequential analysis works as follows:

For look times θ1 < θ2 < θ3···<θ k ,

To re-create the statistics at a sequence of interim analyses, we rely on the following result:

Let θ1 < θ2 < θ3···<θ k , and let

It follows from the properties of Brownian motion that D j , θ j+1) are mutually independent random variables, independent of X k ) and that

Note that from (Equation2.6), we can define the sequence {X j ), j = 1, 2,… k − 1} in a backward recurrence equation as

From (Equation2.6)–(Equation2.8), one can probabilistically and retrospectively reconstruct the group sequential behavior of the nonsequential trial whose test statistics follow (Equation2.3)–(Equation2.5).

Although one could perform the integrals analytically, we elected to utilize simulation with 100,000 replicates. The re-creation of the {X j )}, with θ j  < 1 from X(1) works as follows. Generate the sequence from (Equation2.7), and use (Equation2.8). Determine whether the sequence of interim analysis accepts or rejects the null hypothesis, and at what time.

Without loss of generality, we can presume that σ = 1. (If we work with the published Z-statistic as X(1). it follows that σ = 1 since its p-value is obtained from the standard normal distribution.)

Tables and provide the cutoffs for the classical two stage and three stage Pocock (Citation1977) and O'Brien and Fleming (Citation1979) methods, where futility is not considered. The nonsequential competitor accrues to θ = 1, and has the corresponding Type I error and power as the group sequential designs. The group sequential designs need to have a maximum accrual duration θ > 1 to achieve the same operating characteristics as the nonsequential design with θ = 1. The cutoffs can be obtained from Pocock (Citation1977) and O'Brien and Fleming (Citation1979). Although we computed the ceiling times directly from (Equation2.2)–(Equation2.4), they can also be obtained from group sequential statistical packages such as East, per Senchaudhuri et al. (Citation2005). To design a group sequential study from Table or , compute the sample size for the nonsequential study, as θ = 1, and prorate to the look times in the table.

Table 1A. Classical study designs with two total looks, look times θ j as fractions of the nonsequential study, and no provision for futility (Z A j ) = 0)

Table 1B. Classical study designs with three total looks, look times θ j as fractions of nonsequential study and no provision for futility (Z A j )  = 0)

2.1. Illustration for the ISIS #2 trial

The second International Sudden Infarct Study (Citation1987 Citation1988) was a double-blind two-by-two factorial study pitting aspirin vs. placebo and streptokinase vs. placebo. The summary results for five-week mortality are presented in Table . This study had a data and safety monitoring committee who met every six months, but apparently had no formal stopping rules for efficacy. The trial accrued patients for three years, and at about two years into the randomization, they did find one patient subset with a perceived superiority of streptokinase to placebo. From that time forward, each physician in this multicenter trial was given the option to exclude patients in that subset. Further, although five-week mortality was a major endpoint, there were two others. Hence, our analysis of Table should be viewed more as a conservative assessment. It is presumed that the study will be analyzed according to the designed operating characteristics, and that if a major endpoint crosses a stopping barrier, the study would be flagged for review, and continued only under the most extreme conditions. One might therefore view the other endpoints as affecting early termination in the sense that the study might close even earlier if one of the other endpoints flags the study before the five-week mortality endpoint does. The argument of Souhami (Citation1994), that the lower limit of the confidence interval needs to be clinically important to gain widespread acceptance of the treatment, needs to be taken into account when these calculations are interpreted. However, the study planners indeed used a null hypothesized difference of zero.

Table 2. Results of ISIS #2 (five week mortality) where entries are deaths/number of patients

Suppose this study indeed had no sequential monitoring plan, so that the Brownian motion approximation can be used. Table provides the stopping distributions for the O'Brien and Fleming (Citation1979) and Pocock (Citation1977) designs, each with three equally spaced looks, using a two-sided p-value of 0.01 and 90% power, the operating characteristics used by the original study planners. As seen in Table , for the Pocock superimposed design, the mean sample size to reach a conclusion, given the actual result, would have been 0.504(17,187) = 8,662 patients.

Table 3. Second-guessing for ISIS #2 trial (1988)

This analysis demonstrates that it would have been overwhelmingly unlikely that every interim analysis before completion of the planned accrual would have failed to pick up a significant difference. An interim result might have led to a decision to (a) drop double placebo in favor of an additivity question; (b) drop one of the placebos, aspirin, or streptokinase; or (c) halt the study entirely.

3. FOUR-STAGE REFERENCE DESIGNS

In order to provide good reference designs, it seems reasonable to consider four-stage designs as a pragmatic maximum number of stages one might consider. For example, if the outcome takes four weeks to collect, one has to close the study to accrual for up to three separate four-week periods between stages to ensure that complete data are available to do the interim analysis. Reference two-stage designs can be found in Shuster et al. (Citation2002), and a reference three-stage design can be found in Shuster et al. (Citation2004).

Two approaches were used. First, we employed the method of Chang (Citation1988 Citation1996) for equally spaced looks, with two modifications. First, we increased the maximum allowable sample size in the published table from 105% of the single stage to 133%. A ceiling larger than 133% would prevent the result at the actual end of the nonsequential trial to definitively determine whether three or four looks would be undertaken, given no closure by the second look. Second, we altered the loss function from the expected sample size under the alternative hypothesis to the average of the expected sample sizes under the null and alternate hypotheses:

If futility is important, it is vital to penalize a design with poor performance under the null hypothesis.

3.1. Global Optimization

Based on the ideas of Kiefer and Weiss (Citation1957), Chang (Citation1988 Citation1996) developed a method to find optimal designs. A group sequential testing procedure is specified by sample size and stopping boundaries at each of four stages. Let D denote the set of all possible group sequential testing procedures, including all randomized procedures as well, where randomized refers to a random choice of decisions rather than random assignment of patients to treatment (Lehmann, Citation1997). A convex combination with a i  > 0, and D i  ∊  D defines a randomized design that uses D i , with probability a i . Via this mathematical artifice, any convex combination of a finite number of procedures in D still belongs to D .

Each of these procedures produces a design point ∂ = (α, β, E{θ}), where α and β are the Type I and Type II errors, respectively, and E{θ} is given in (Equation3.1). A testing procedure is optimal if there does not exist a competing procedure that produces a dominant design point ∂′ = (α′, β′, E{θ}′) such that α′ ≤ α′, β′ ≤ β, and E{θ}′ < E{θ}. The set consisting of all design points forms a convex set Ω of ∂ values in R 3. Hence, the lower surface Γ of Ω, consisting of all optimal design points, is convex (convex hull). Let ∂ = (α, β, E{θ}) be an optimal design point, and ∂′(α′, β′, E{θ}′) be any point (∂′ ≠ ∂) on the line segment from the origin to ∂. Because of the fact that and α′ < α, β′ < β, and E{θ}′ < E{θ}, ∂′ is no longer a design point. So the optimal design surface Γ is visible from the origin. For any vector B = (b 0, b 1, b 2) with b i  > 0 and ∑ b i  = 1, there exists a plane tangent to Γ with B as the normal vector. The intersection point ∂(B) of the tangent plane and Γ is called the optimum design point associated with the vector B. To find ∂(B) for a given vector B, we need to find the design that minimizes the inner product

The optimal design can be characterized as a Bayes solution. Let Θ = {− Δ a , 0, Δ a ) be the parameter space and a 0 and a 1 be decisions of accepting H 0 and rejecting H 0, respectively. The loss function is denoted by L(μ;·) with

The cost of taking a single observation is denoted as c(μ), which depends on the parameter value μ. The risk of a testing procedure is defined as the sum of the expected cost function and the expected loss function. Let (Πa , Π0, Π a ) be the prior distribution over the parameter space Θ = {− Δ a , 0, Δ a }. The risk can be expressed as

If Π0 = 1/2, Πa  = Π a  = 1/4, £1 = b 1 and £2 = b 0, and c(− Δ a ) = c a ) = c(0) = b 2/2, then (Equation3.3) is equal to 0.5 times that in (Equation3.2).

Thus the design that minimizes (Equation3.2) can be characterized as the Bayes solution with the loss function, cost function, and prior distribution as defined above.

The optimal design associated with vector B can be found by the method of backward induction. In Section 4 of Chang (Citation1996), a numerical method is provided to find the stopping boundaries that minimize (Equation3.2) for given sample sizes at all stages. The only modification is in the last term in braces of the expression of C m on p. 368. This is done to account for a different risk function. Chang (Citation1996) minimized the average sample size at the alternate hypothesis, whereas in this article, we minimize the average of the sample sizes under ths null and alternate hypothesized values.

Note that the optimal design associated with vector B may not satisfy the Type I and Type II error probability constraints. A numerical procedure developed in Section 5 of Chang (Citation1996), directly applicable to the requirements of our methods, was then utilized to search over {B} for the optimal design meeting these constraints.

The designs are presented in Table , along with their properties.

Table 4. Four-stage reference designs

3.2. Optimization Over a Grid

Secondarily, we optimized the design over a rich array of alternatives, constrained as seen below. The purpose of this secondary optimization is to assess the robustness of the optimal designs when other reasonable loss functions (e.g., the expected sample size under the null hypothesis, the expected sample size under the alternate hypothesis, the larger of the expected sample size under the null hypothesis and alternate hypothesis, or the global maximum of the expected sample size over all values of the parameter of interest) are considered.

The definition of the grid search is as follows:

  1. Intervals between looks were made at equal increments in information time, namely 25% of θMAX, where θMAX is the time of the final planned look if the study reaches that point.

  2. The cutoff for significance in (Equation2.3) is defined in four modes corresponding to λ ε{0, 1/10, 1/6, and 1/2}

    λ = 0 represents the Pocock (Citation1977) cutoff and λ = 1/2 represents the O'Brien and Fleming (Citation1979) cutoff. The cutoffs for significance are constant for the Pocock (Citation1977) method and decrease at varying rates with increasing information time for the other values of λ.

  3. The cutoff for futility in (Equation2.4) is studied in the following grid. Selection was based on the premise that it would be a good strategy to require each futility cutoff to increase with increasing information:

  4. Two-sided Type I error α and power π are in one the following combinations: (α = 0.05, π = 0.80), (α = 0.05, π = 0.90), and (α = 0.01, π = 0.90).

  5. The maximum look time θMAX cannot exceed 4/3 of the look time required for a single-stage study to complete its accrual. That is, only one look can occur after a fixed-sample-size study of the same Type I error (two-sided) and power would have been completed.

The properties of the best performers, that is, the designs in the grid with the lowest expected losses under the loss functions specified, are provided in Table in parentheses. No single design was the universal grid champion for all loss functions. These properties were based on 100,000 matched simulations, each using the same data per simulation. These results demonstrate a robustness property of the globally optimized designs for the loss function (Equation3.1) under alternate loss functions.

For the null and alternate hypothesis, respectively, E(θ|H 0) and E(θ|H a ) represent the mean accrual fraction (sample size or failures for proportional hazards models) this design requires of the single-stage design with the same operating characteristics. Note that the reference design is either better or at worst slightly inferior to the grid search champion for each optimality criterion.

To use a reference design, one simply does a nonsequential sample-size calculation (or, in the case of survival analysis under proportional hazards, do an expected-failure calculation). Set this requirement as θ = 1, and apply the look fractions and cutoffs from Table . For example, if a design with power 80% and Type I error 5% required 100 expected deaths, the look times for reference design I would be after 34, 67, 100, and 133 expected deaths.

A somewhat surprising observation is that for designs II and III, the cutoff for significance is not monotonic with look times, but look 3 has a higher cutoff than look 2. We have no explanation for this apparent aberration.

4. SUPERIMPOSITION OF A GROUP SEQUENTIAL DESIGN OVER A NONSEQUENTIAL DESIGN

In Tables , , and , representing reference designs I, II, and III of Table , respectively, we present the stopping time distribution of the design, conditional on an observed Z-statistic in an equally powered nonsequential study. The conditional mean sample size is also given. Since the designs have their third look at θ = 0.9975, it is not surprising that for most of the table, the probability of stopping at look 3, p 3, is either zero or a very substantial value for these designs. Although the table is rather coarse in Z, interpolations of the conditional means and the probabilities of stopping at looks 1 and 2, p 1 and p 2, are sufficiently accurate for practical purposes. For all of the designs, it is possible that the conditional mean exceeds 1, implying that for some outcomes, the reference designs are actually less efficient than the nonsequential designs. These anomalies occur where the significance of the single stage design is borderline (Z near the stage 3 cutoff).

Table 5A. Second-guessing nonsequential designs by reference design I, Table 4 (Type I error = 0.05, power = 0.80), Z = observed value of final single stage statistic, where p 1, p 2, p 3, and p 4 are the probabilities of stopping at stages 1, 2, 3, and 4, respectively, and E(θ|Z) is the average fraction of information in the group sequential trial relative to the actual trial, given the actual Z-value

Table 5B. Second-guessing nonsequential designs by reference design II, Table 4 (Type I error = 0.05, power = 0.90), Z = observed value of final single stage statistic, where p 1, p 2, p 3, and p 4 are the probabilities of stopping at stages 1, 2, 3, and 4, respectively, and E(θ|Z) is the average fraction of information in the group sequential trial relative to the actual trial, given the actual Z-value

Table 5C. Second-guessing nonsequential designs by reference design III, Table 4 (Type I error = 0.01, power = 0.90), Z = observed value of final single stage statistic, where p 1, p 2, p 3, and p 4 are the probabilities of stopping at stages 1, 2, 3, and 4, respectively, and E(θ|Z) is the average fraction of information in the group sequential trial relative to the actual trial, given the actual Z-value

It is of interest to revisit the ISIS #2 trial (1988), and compare reference design III of Table 4 to the results in Table 3. For the double placebo vs. double drug comparison, where Z = 7.85, the reference design would stop at stage 1 (stage 2) with probability 96.9% (3.1%) for a mean of 0.34. (The three-stage Pocock design with no futility component had a mean of 0.38.) The results for Z = 5.23 and Z = 5.90 had averages of 0.51 and 0.44 for the four-stage reference designs vs. 0.50 and 0.44 for the three-stage Pocock design (1977). While the reference design has one more stage, it also had to accommodate the provision for futility. Hence, the similarity of the means is not surprising.

5. SUPERIMPOSITION OF ONE GROUP SEQUENTIAL DESIGN OVER ANOTHER

If the authors of a published group sequential design provide the sequence of look times, the sequence of Z-statistics at those looks, the planned total information time, the power, and the Type I error, then indeed we can stochastically superimpose a reference design up to the point where the trial was halted.

There are two key results that are needed to obtain the necessary probabilistic properties.

Result 5.1

If X(θ) is Brownian motion per (Equation2.1), and θ1 < θ2 < θ3, then the conditional distribution of X2) given X1) and X3) is normally distributed, N(A, B), where the mean A = X1) + [(θ2 − θ1){X3) − X1)}/(θ3 − θ1)] and the variance B = σ2[(θ2 − θ1)(θ3 − θ2)/{θ3 − θ1}2].

This follows from a change of variables in the mutually independent random variables (from the independent increment property of Brownian motion) of

The joint distribution is divided by the marginal distribution.

Result 5.2

Consider a Markov process {V j }, such as Brownian motion, with random variables Y 1, Y 2, Y 3, and Y 4, an arbitrary subset of {V j }, having continuous joint density f(x 1, x 2, x 3, x 4) with increasing subscripts representing increasing times. We shall denote the conditional density by f R|S where R represents the subscripts of the random variables for the density of these variables given the set of variables in S. If S is the empty set, we denote f R as the marginal distribution of the variables in the set R.

Claim: f 23|14 = f 2|13 f 3|14.

Proof

A key to the result below is based on the Markovian property f 4|123 = f 4|3 = f 4|13 (the probability distribution of the future given a past history depends only upon the most recent event in this history):

In the same way, applying induction to the above result, it can be shown that if X i is an (N + 1)-dimensional subset of the Markov process {V j }, with continuous joint density, and i = 1, 2, 3,…, N, N + 1 representing increasing time,
If the superimposed design has one or more looks between two actual looks, we can use Result 5.1 to generate the last look in the interval. If a second look is needed, we partition it by Result 5.2 and use Result 5.1 to generate the next-to-last look, etc. Specifically, suppose that actual looks occurred at information fractions γ1 and γ2 and the superimposed design had four looks at θ1 < γ1 < θ2 < θ3 < θ4 < γ2. Conditionally on X1) and X2), the random vectors X1) and [X2), X3), X4)] are independent.
From the generalized Results 5.2 and 5.1, we can generate X4) given X1) and X2), X3) given X4) and X1) and then X2) given X3) and X1) as the appropriate normally distributed sequence per Result 5.1. Note that the simulated X(θ) values for θ < γ2 do not involve the drift parameter Δ.

6. DISCUSSIONS

Interest in the ability to second guess-researchers and data and safety monitoring committees has increased in recent years. For example, Lilford et al. (Citation2001), in an editorial on monitoring clinical trials, make the statement, “Withholding, without debate and endorsement of the policy, information that patients might find useful is at best paternalistic, at worst authoritarian and arguably unnecessary.” Reidpath (Citation2001) goes a step further and suggests that interim data be available, not just results of interim analysis. The National Institutes of Health have adopted the policy (effective October 2003) that data accruing from grants over $500,000 direct costs in any year be made public as soon as an article is accepted for publication. Presumably, if a sequential method were employed, the interim data must also be included. Our own view is that indeed the interim monitoring should be entrusted to a data and safety monitoring committee, and should be kept confidential from the general public until the study is accepted by a peer-reviewed medical journal. The committee balances patient advocacy considerations with the efficacy and safety issues. But the ability to second-guess after the fact will put these committees on notice, since their interim decisions can be scrutinized once the study is published. Since it is very tedious for an individual who might question the conduct of a trial to acquire and reanalyze the data, the methods proposed here can provide a good screening step as to whether or not to request the full data set.

These methods should prove very useful to reviewers of manuscripts reporting results of clinical trials, in terms of what information needs to be reported. If a paper utilized a group sequential design, then at a minimum, the design, including the plans for interim monitoring, should be fully disclosed. In addition, the results of all interim analyses should be disclosed, even if these interim analyses did not reach a stopping barrier. This disclosure also helps readers determine if a stopping barrier was reached, but overruled. The sequence of Z-statistics and information fractions should be part of the report. If unplanned interim analyses were performed, or if a nonsequential study's report was based on an accrual substantially different from the planned accrual, this needs to be explained in the paper. If an editor or reviewer sees these details lacking, then the authors should be asked to supply them (or provide a good reason why they cannot be supplied) in order to get the paper accepted. If the publication fails to provide this information, the ultimate readers would not have the tools necessary to construct a second-guess analysis to assess an alternate design. It is of note that the Pharmaceuticals Manufacturing Association's Biostatistical and Medical Ad Hoc Committee on Interim Analysis (Citation1993) included the following recommendations as standard operating procedures: (a) document plans for interim analysis in trial protocols, and (b) document interim analysis of trials and their consequences on trial design and conduct. The above recommendations for publication is in agreement with what this committee considered to be good scientific practices.

It also needs to be pointed out that the quantification of the gains of a group sequential design over a nonsequential design may be slightly overstated in terms of calendar time. Although the probabilistic details are rigorous, a real group sequential trial might have required a time-out after each stage, so that we could accrue the information necessary to make a decision as to whether or not to proceed to the next stage. The nonsequential design does not have these gaps. Hence, the expected calendar time advantage of the group sequential design would be less than it would appear from an accrual vantage point. Good practice for the group sequential trial is to plan carefully as the time of an analysis approaches. If one can get the most up-to-date information together as the potential time-out for interim analysis is approaching, one could obtain X(θ) for θ close to the time of the interim analysis, and use the Brownian motion property under both the null and alternate hypotheses to assess the probability of continuation. If under both scenarios, that is, high (say above 80%), then continuous accrual could proceed until the time of completion of the real interim analysis. If one of these probabilities of continuation is below this threshold, a temporary stop would be made. Fortunately, with the availability of the Internet, rapid acquisition of data is much more feasible today than it was a decade ago. Hence, we believe it is feasible to meet the objectives of timely data acquisition as an interim analysis approaches.

A valid critique of the second-guess methodology is the following. If indeed you had a time-out after the first stage, then this could affect everything thereafter. A few patients in the second and subsequent stages (if applicable) would never have entered the trial. There is no conceptual problem with the assumption that the superimposed trial randomizes the nonexcluded participants in exactly the same manner as the actual trial. The second-guess methodology of this paper is fully rigorous if continuous accrual (i.e., without time-outs) can be maintained. It should be viewed as a reasonable approximation when the time-outs are of small duration in terms of patients lost relative to the total planned accrual of the trial.

Trials involving survival are especially difficult to monitor. There are two enormous problems that should be red flags. First, the information fractions tend to accumulate at a far slower pace than the accrual fraction. Second, if an experimental arm involving long-term medication shows an early advantage, one worries about the potential for the advantage to disappear or even turn in the opposite direction as tolerance to the medication develops or cumulative toxicity occurs. One such trial, in which simvastatin (Zocor) was compared to a placebo to prevent death and serious cardiac events in high-risk subjects (Heart Protection Study Collaborative Group, Citation2002), was heavily criticized by Migrino and Topol (Citation2003) for staying open too long. The study, which began its accrual in July 1994, was designed to accrue 20,000 subjects over a three-year period with a minimum follow-up of four years (a seven-year study), with 3,000 total expected deaths and a power of 90% at p = 0.01 (two-sided). Although there was no mention of a formal monitoring plan, a data monitoring committee took annual looks at the data, based mainly on total mortality and secondarily on cardiac mortality. Assuming that a three-stage group sequential design is to be superimposed, the calendar times of three looks at 33%, 67%, and 100% information, assuming uniform accrual and exponential survival, would be approximately 3.2, 5.0, and 7.0 years (Shuster, Citation1992). The first look would occur two months after termination of accrual. Migrino and Topol (Citation2003) state that the study should have been reported at a minimum by July 1999 (rather than the actual cutoff of October 2001). All patients would have had two to five years of follow-up (none followed more than the planned five-year endpoint). The final result saw total mortality of 1,328/10,269 (drug) vs. 1,507/10,267 (placebo). Although the data for a logrank test were not available, the Z-test for proportions (3.62) will be used as an illustration. The information at calendar time, July 1999 (5.0 years after initiation of accrual) is 66%, which would roughly correspond to the second look. From equation (Equation2.8), it follows that the predictive distribution of the Z-value in July 1999 follows a normal distribution with a mean of 2.96 and a standard deviation of 0.58. In fact, to achieve a significant difference at the second of three stages for the O'Brien–Fleming test, a Z-value of over 3.18 would have been required per Table , 90% power at p = 0.01, two-sided, with no curtailment for futility. Given the final Z-value, the probability that the O'Brien–Fleming second look was significant was 36%. The more aggressive Pocock (Citation1977) test would have required a Z-value above 2.87 (Table ) at that point, a 56% likelihood. Therefore, it does not seem to be a reasonable criticism that the data were overwhelming at that time (July 1999).

When superimposing one group sequential trial over another, it is of interest to note that the only methodological issues are whether it is valid to consider the bridges between the actual looks as independent Brownian bridges, as described in Section 5, and whether or not a stopping barrier was achieved but overruled. As long as all look times and their corresponding Z-statistics are disclosed, whether planned or unplanned, and no stopping barrier was passed prior to the final analysis, the methods of Section 5 remain valid for a trial whose unconditionally obtained information accrues as approximate Brownian motion. If a stopping barrier was indeed overruled, the second-guess methodology described herein would be moot, against the possible second-guessing of that overrule.

This paper presents a unique application of the properties of a sufficient statistic. The traditional asset of sufficient statistics is that you can ignore the rest of the data to make an inference about the parameters of interest. In the present case, we turn this around. The sufficiency of X(θ) for the parameter Δ over the interval [0, θ] allows us to probabilistically reconstruct the rest of data prior to time θ without the need to know Δ.

While group sequential methods are perhaps the most common methods for interim analysis, there are others that are commonly used. The methods of this paper are not useful to second-guess these types of designs. As Enas et al. (Citation1989) point out, other methods include the sequential probability ratio (continuous interim analysis), a Bayesian approach, an empirical Bayesian approach, stochastic curtailment, and predictive distributions (an analytic relative of stochastic curtailment). That article also contains some practical advice for designing efficient clinical trials.

6.1. Summary

This paper attempts to make four public health contributions. First, it will provide readers and reviewers of medical articles with a new tool to look critically at reports of clinical trials to assess the design used in the study. Second, since designs of trials could now be under increased scrutiny, planners of trials will be more motivated toward improving the efficiency of their designs. Third, the reference designs provided in this paper will prove useful in planning efficient trials. Finally, it is hoped that this paper will act as a catalyst for obtaining additional reference designs that are efficient in some sense, including those with small sample sizes, those with multivariate endpoints, and those accommodating unequal intervals between looks, so that this second-guess capability can be extended to other types of trials.

ACKNOWLEDGMENTS

This work was partially supported by Grants M01 RR00082 from the National Center for Research Resources and U10 CA98413 from the National Cancer Institute, National Institutes of Health.

Notes

See (2.3)–(2.5) to clarify notation. Note that the Pocock approach has a higher upper limit for sample size than the O'Brien–Fleming approach.

See (2.3)–(2.5) to clarify notation. Note that the Pocock approach has a higher upper limit for sample size than the O'Brien–Fleming approach.

Z (pooled-variance) for double drugs vs. double placebo: 7.85

Z (Mantel–Haenszel) aspirin vs. placebo: 5.23

Z(Mantel–Haenszel) streptokinase vs. placebo: 5.90.

p(θ) = Probability of stopping for significance at time θ, where θ values are per Table 2 for Type I error 0.01 and power 0.90. E(θ|Z) = Average stopping time as a fraction of actual stopping time.

∗ θ is the time (sample size) relative to θ = 1, the required time for a single-stage study to have Type I error and power as indicated in the first two lines of the table.

Recommended by Uttam Bandyopadhyay

REFERENCES

  • Barber , S. and Jennison , C. ( 2002 ). Optimal Asymmetric One-Sided Group Sequential Tests, Biometrika 89 : 49 – 60 .
  • Chang , M. N. ( 1988 ). Optimal Designs for Group Sequential Clinical Trials, Technical Report 295, Department of Statistics , Gainesville : University of Florida .
  • Chang , M. N. ( 1996 ). Optimal Designs for Group Sequential Clinical Trials, Communications in Statistics—Theory & Methods 25 : 361 – 379 .
  • Eales , J. D. and Jennison , C. ( 1992 ). An Improved Method for Deriving Optimal One-Sided Group Sequential Tests, Biometrika 79 : 13 – 24 .
  • Eales , J. D. and Jennison , C. ( 1995 ). Optimal Two-Sided Group Sequential Tests, Sequential Analysis 14 : 273 – 286 .
  • Ellenberg , S. S. ( 2003 ). Are All Monitoring Boundaries Equally Ethical? Controlled Clinical Trials 24 : 585 – 588 .
  • Enas , G. G. , Dornseif , B. E. , Sampson , C. B. , Rockhold , F. W. , and Wuu , J. ( 1989 ). Monitoring vs. Interim Analysis of Clinical Trials: A Perspective from the Pharmaceutical Industry, Controlled Clinical Trials 10 : 57 – 70 .
  • Fleming , T. R. and DeMets , D. L. ( 1993 ). Monitoring of Clinical Trials: Issues and Recommendations, Controlled Clinical Trials 14 : 183 – 197 .
  • Geller , N. L. and Pocock , S. J. ( 1987 ). Interim Analysis in Randomization Trials: Ramifications and Guidelines for Practitioners, Biometrics 43 : 213 – 223 .
  • Heart Protection Study Collaborative Group . ( 2002 ). MRC/BHF Heart Protection Study of Cholesterol-Lowering with Simvastatin in 20,536 High-Risk Individuals: A Randomised Placebo-Controlled Trial, Lancet 360 : 7 – 22 .
  • International Conference on Harmonization . ( 1998 ). Guidance on Statistical Principles for Clinical Trials, Federal Register 63 ( 179 ): 49583 – 49598 .
  • International Sudden Infarct Study #2 Steering Committee . ( 1987 ). Intravenous Streptokinase within 0–4 Hours of Onset of Myocardial Infarction Reduced Mortality, Lancet 1 : 502 .
  • International Sudden Infarct Study #2 Collaborative Group . ( 1988 ). Randomised Trial of Intravenous Streptokinase, Oral Aspirin, Both, or Neither among 17,187 Cases of Acute Myocardial Infarction: ISIS 2, Lancet 2 : 349 – 360 .
  • Jones , D. and Whitehead , J. ( 1979 ). Sequential Forms of the Log Rank and Modified Wilcoxon Tests for Censored Data, Biometrika (66): 105–114. Corrections, Biometrika 68 : 576 .
  • Kiefer , J. and Weiss , L. ( 1957 ). Some Properties of Generalized Sequential Probability Ratio Tests, Annals of Mathematical Statistics 28 : 57 – 74 .
  • Lehmann , E. L. ( 1997 ). Testing Statistical Hypotheses , New York : Springer-Verlag .
  • Lilford , R. J. , Braunholtz , Z. , Edwards , S. , and Stevens , A. ( 2001 ). Monitoring Clinical Trials—Interim Results Should Be Publicly Available, British Medical Journal 323 : 441 – 442 .
  • Migrino , R. Q. and Topol , E. J. ( 2003 ). A Matter of Life and Death? The Heart Protection Study and Protection of Clinical Trial Participants, Controlled Clinical Trials 24 : 501–505.
  • O'Brien , P. C. and Fleming , T. R. ( 1979 ). A Multiple Testing Procedure for Clinical Trials, Biometrics 35 : 549 – 556 .
  • Pharmaceuticals Manufacturing Association Biostatistics and Medical Ad Hoc Committee on Interim Analysis. (1993). Interim Analysis in the Pharmaceutical Industry, Controlled Clinical Trials 14: 160–173.
  • Pocock , S. J. ( 1977 ). Group Sequential Methods in the Design and Analysis of Clinical Trials, Biometrika 64 : 191 – 199 .
  • Pocock , S. J. ( 1982 ). Interim Analysis for Randomized Clinical Trials: The Group Sequential Approach, Biometrics 38 : 153 – 162 .
  • Reidpath , D. ( 2001 ). Interim Data Are at Least as Important as Interim Analyses, British Medical Journal 323 : 1425 .
  • Rosner , G. L. and Tsiatis , A. A. ( 1989 ). The Impact That Group Sequential Tests Would Have Made on ECOG Clinical Trials, Statistics in Medicine 18 : 505 – 516 .
  • Senchaudhuri , P. , Mehta , C. R. , Deshmukh , A. , Kulthe S. , Ghanekar , A. , Kannappan A. , Khandelwal , L. , and Sathe , A. ( 2005 ). Early Stopping in Clinical Trials (East 4.0) , Cambridge : Cytel Statistical Software .
  • Shuster , J. J. ( 1992 ). Practical Handbook of Sample Sizes for Clinical Trials , Boca Raton : CRC Press .
  • Shuster , J. J. , Link , M. , Camitta , B. , Pullen , J. , and Behm , F. ( 2002 ). Minimax Two-Stage Designs with Applications to Tissue Banking Case-Control Studies, Statistics in Medicine 21 : 2479 – 2493 .
  • Shuster , J. J. , Chang , M. N. , and Tian , L. ( 2004 ). Design of Group Sequential Clinical Trials with Ordinal Categorical Data Based on the Mann-Whitney-Wilcoxon Tests, Sequential Analysis 23 : 414 – 426 .
  • Simon , R. ( 1991 ). A Decade of Progress in Statistical Methodology for Clinical Trials, Statistics in Medicine 10 : 1789 – 1817 .
  • Souhami , R. L. ( 1994 ). The Clinical Importance of Early Stopping of Randomized Trials in Cancer Treatments, Statistics in Medicine 13 : 1293 – 1295 .
  • Therneau , T. M. , Wieand , H. S. , and Chang , M. N. ( 1990 ). Optimal Designs for a Group Sequential Binomial Trial, Biometrics 46 : 771 – 781 .
  • Tsiatis , A. A. ( 1981 ). The Asymptotic Joint Distribution of the Efficient Scores Test for the Proportional Hazards Model Calculated Over Time, Biometrika 68 : 311 – 315 .
  • Recommended by Uttam Bandyopadhyay

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.