887
Views
0
CrossRef citations to date
0
Altmetric
Special Section on Roles of Hypothesis Testing, p-Values, and Decision-Making in Biopharmaceutical Research

The p-Value: Guilty by Association

Pages 32-35 | Received 19 Oct 2020, Accepted 18 Nov 2020, Published online: 19 Feb 2021

It is noteworthy how many people will respond, when conversation turns to mathematics or statistics, that they personally have no aptitude and struggled with the subject since school. Such a reaction seems almost as common in drug development circles as it is in wider society. Still, while it is testament to their importance as anchors for discussing clinical trial data, p-values and “statistical significance” stand in stark contrast to other statistical concepts or methods in the extent to which they are used with confidence by non-statisticians.

Gibson’s (Citation2020) article nicely illustrates various uses of p-values, some being uses for which the p-value was developed, others not. Of particular importance is the distinction between a p-value (and other summary statistics) used as a measure of evidence against a null hypothesis from a prespecified primary analysis in a clinical trial that is serving as confirmation or replication of earlier experiments, and p-values (and other summary statistics) generated in exploratory studies or exploratory analyses, in particular if they are given priority because they are more extreme and hence more attractive. The generally accepted need for replication in science arises precisely because experiments and datasets can deliver results that can be spurious. To my understanding, Gibson rightly differentiates between the p-value itself as a mathematical quantity and the erroneous association between the p-value and the so-called “replication crisis,” the latter being driven not because of any given statistical metric being misleading or sup-optimal, but by “excessive optimism” or the wish being the father of the thought: I want so much for the result to be true, that I will ignore any idea that it is not.

1 Do Not Blame the p-Value for Misleading Data

Data inform, but they also mislead. A dataset can be biased, such that the data themselves and the summary measures or analyses thereof do not reflect the truth. Variability is importantly different in concept but can be similarly misleading. These, together with a human disposition driven by scientific rigor or financial or reputational pressures to investigate data until something interesting is found, creates something of a perfect storm. As we all understand, properly tortured, a dataset will eventually yield an interesting finding, if only from a secondary or post hoc analysis, a secondary or exploratory endpoint or in a subgroup. Furthermore, just because an analysis is prespecified as primary does not make it immune from a finding being a false-positive or its magnitude being exaggerated by a play of chance in the data. These are reasons for science and drug development to demand confirmation and sometimes replication of findings, and they are also the reasons to have some skepticism as to whether it will be achieved.

It is interesting to reflect of the role of the p-value in this story. The p-value, and its significance in respect of a particular hypothesis test, has become the most important determinant as to whether something interesting has been found. It might also be the basis on which that finding is judged for its validity by others, for example, by a peer reviewer for publication in a journal or by a funding committee to progress to the next experiment. Indeed, it seems that a significant p-value is sometimes taken to connote not only that a finding is interesting, but that the associated result, for example, the estimated size of a treatment effect, is true. If a significant p-value is automatically taken to connote a true result a crisis in attempts at replication is inevitable, but the p-value itself is not to blame.

If p-values are not appropriate to make reliable predictions on the probability of replication, then what? Well, if bias and variability mean that the data in the sample are obfuscating the truth, all unadjusted summary metrics of those data have the potential to mislead. Again, the problem does not lie with the p-value itself, but with the decisions made and actions taken on the basis of a significant hypothesis test. The methodological equivalent of saying that palm oils and plastics are the root cause of environmental problems rather than the human beings that use vast quantities of both without due consideration.

2 Developing an Evidence Base to Support Regulatory Approval

Having violently agreed with Gibson in the statement of the problem, this commentary rehearses some observations and reflections on replication and confirmation in the development and regulation of medicines, and the different uses of the p-value in that process.

Drug development is driven primarily by generating and exploring scientific hypotheses that develop scientific and pharmacological understanding. A pharmacological rationale to test a particular formulation of an active substance against a particular pathology will be confirmed or revised through nonclinical models and clinical pharmacology studies. These studies will inform about exposure and exposure–response relationships, help to identify a target posology and a target population. The understanding of pharmacology and the series of exploratory clinical studies determine the manner in which one or more confirmatory studies will be conducted, testing whether a drug effect can be said to exist and to estimate its magnitude under well-understood (not necessarily restrictive) experimental conditions. Perhaps this nomenclature is important; regulatory guidance documents speak more of confirmation than of replication, though multiple confirmatory studies (replication of confirmatory evidence) adds considerable strength to the evidence provided to support regulatory decision making. At the same time, empiricism is another important driver. Designs for confirmatory studies are commonly refined based on the most favorable finding from exploratory studies in terms of dose, endpoint definition or population selection.

Results from exploratory studies, or from secondary analyses in confirmatory studies are often labeled by regulators as “hypothesis generating,” requiring confirmation in an independent experiment. The phenomenon of false-positive or exaggerated results, exacerbated by the prioritization of more favorable results in empirically led development, gives a strong scientific rationale keep the exploratory and confirmatory stages of drug development distinct even if pressures of time and cost try to dictate otherwise. Frequentist methods are of course firmly established in drug development and regulation: p-values, hypothesis testing and control of Type I error to 2.5% one-sided or something more extreme if evidence of efficacy is to be provided on the basis of a single pivotal trial. The use of frequentist methods is not the reason, however, to keep the exploratory and confirmatory stages of development distinct. The fact that integration of existing information is conceptually and computationally easier under a Bayesian framework is not the point. The problems of false-positive or exaggerated results in datasets and unadjusted summary statistics remains whether you prefer a frequentist or a Bayesian approach: a point not always recognized by (mostly nonstatistical) proponents of the latter. Indeed, the perennial debate between which framework should be preferred by developers and regulators often misses this point. I can use a Bayesian approach without any major controversy if my analysis uses only the data from the confirmatory study, whereas if I am minded to incorporate existing data that I have selected based on it being more favorable to my research hypothesis, then I would better account for that selection process regardless of the statistical framework.

The totality of evidence for favorable drug effects in a development program as currently conducted is not the fact that a null hypothesis is rejected once or twice at a particular level of statistical significance. There is, additionally, the whole series of independent nonclinical and clinical experiments that build understanding of the mechanism of action, of the pharmacodynamic response relevant to the pathophysiological process of the disease, evidence of dose-response and pharmacokinetic/pharmacodynamic relationships, and measures of efficacy in exploratory trials. It is on the strength of this supporting information that criteria for hypothesis tests in confirmatory trials are agreed. These criteria can be debated of course, whether they are too strict, or too liberal—why 5%? why not 1% or 10%? But if anyone wishes to replace the p-values generated in confirmatory studies with another metric or measure for the strength of evidence, in particular one that accumulates information across these different experiments, the same totality of evidence should eventually be reached, resulting in more extreme criteria for success in confirmatory studies.

If there is a replication crisis, should regulators ever trust a significant p-value from a single pivotal trial? The CHMP Points to Consider document on application with one pivotal study (European Medicines Agency Citation2001), indicates that the extent of confirmatory data needed will depend upon what is established for the product in earlier phases, and what is known about related products, citing a lack of pharmacological rationale as a reason to plan for more than one confirmatory study, and essentially establishing a strong pharmacological understanding as one pillar to support a regulatory opinion. That document also specifies criteria to be met for the results of a single pivotal study to be accepted, including that statistical evidence considerably stronger than p<0.05 is usually required. This is not always achievable. More and more there are products targeted toward rare diseases and subsets of (sometimes already small) target populations based on mechanism of action. Replication in such cases can anyway be ethically unacceptable or operationally infeasible but, importantly, can also be less critical scientifically where the products in question are targeted based on the strength of pharmacological understanding and rationale.

3 p-Values and Hypothesis Tests in Medicines Regulation

The highest profile p-value is for the prespecified primary analysis of the prespecified primary endpoint. This p-value dictates the formal success or failure of a trial and hence can be a gatekeeper for an application for Marketing Authorisation: while there can be exceptions p<5% two-sided and you are in the game ready to debate clinical relevance and benefit-risk, p>5% two-sided and getting anything from the game is an uphill battle; p<5% two-sided and the dataset is worth thousands of person-hours on interrogation and interpretation, p>5% is often worth only a post-mortem. Whether it makes sense for these decisions to be so closely related to a dichotomous interpretation of a hypothesis test, and the choice of the 5% significance level, can be debated of course. Whereas Gibson (Citation2020) explains that a p-value of 0.02 is not twice the strength of evidence as a p-value of 0.04, everyone can recognize the point that p-values of 0.049 and 0.051 confer a similar strength of evidence. Again, the debate is not with the p-value itself, but with a scientific and regulatory policy that has developed based on that metric.

I think there has been considerable value of there being rules for this part of the game between the drug developer and the regulator, so that consistency can be applied in regulatory decision making and so that the game is, at least in that regard, predictable to those wishing to play. With rules in place, specifically a requirement for statistically significant tests of hypotheses related to primary trial objectives, it is easy to speculate that potentially useful medicines might be missed. On the other hand, without such rules it is easy to think that the type of evidence highlight by Gibson (Citation2020) as leading to “excessive optimism based on unknowingly (and sometimes knowingly) overstated evidence” would result in many premature regulatory applications.

This said, it is inevitable that rules will be bent once they have been set. Of course, while a statistically significant hypothesis test related to a primary trial objective is close to a necessary condition for a trial to serve as confirmatory evidence in an application for Marketing Authorization, it is, appropriately, not a sufficient condition for regulatory approval. As we all understand, a demonstration of statistical significance does not ensure that a clinically meaningful beneficial effect offsets the adverse effects of the treatment. But the fact that a statistically significant result from a primary hypothesis test is so important to be in the game has arguably given rise to some of the choices now made in clinical trial design and analysis: attempts to load the dice! These choices include the precise definition of the primary efficacy variable, selected not only because it reflects a benefit to patients, but also because it relates to the pharmacology of the intervention; timepoints selected on the basis of when the maximal therapeutic effect might be expected to occur; the identification of trial participants more likely to adhere to treatment without need for treatment holidays or additional treatments being added; the choices made in respect of data handling, ignoring data collected after treatment discontinuation or initiation of another treatment, replacing observable data with methods reliant on the missing-at-random assumption, essentially predicting what would have happened under adherence to treatment. The future impact of the estimand framework outlined in ICHE9(R1) (European Medicines Agency Citation2001) will be interesting in this regard. The estimand is not an end in itself; not simply another way to document what is already done. Instead, it is a means to an end, focusing the selection of patients, treatments and variables and methods of data handling and analysis on questions that are agreed to be of clinical interest.

Ultimately, a developer might be so afraid of a failure to show p<5%, perhaps despite their best intentions, that they try to argue in favor of a single-arm trial using within-patient comparisons to baseline, rather than a randomized trial. Should there be flexibility in the significance level used to conduct hypothesis tests in some settings to encourage or facilitate the conduct of randomized controlled trials? Are higher significance levels justified in settings of high unmet medical need; where the objective relates to survival; or where a sample size sufficient to power a study to detect a minimally clinically relevant effect size on a patient relevant outcome is not feasible? The difficulty of course would be to develop a system that could be applied consistently and would not be open to abuse.

4 Inappropriate Uses of p-Values and Hypothesis Tests

There remain wilful misuses of p-values and hypothesis tests, usually based on the use of a p-value >5%, equating to a nonsignificant result, from a statistical test with low power to detect effect sizes of interest. It remains, for example, common to argue that differences in estimated effects between subgroups should not concern a regulatory reviewer because the associated interaction test “is nonsignificant.” That test results are nonsignificant is also used as a basis to assert that there is no important increase in adverse event rates. Again, such tests are open to abuse, lumping events (preferred terms) together to add noise or splitting events to minimize information and statistical power. The use of significance tests for model selection and to conclude that model assumptions (e.g., normality) are valid can also be discussed.

The question of findings in subgroups is perhaps particularly noteworthy as the role of replication is somewhat reversed and the low probability of replication if a finding is exaggerated or untrue can be important for decision making. Replication, defined as another data source in which the same question or hypothesis can be assessed, has a key role in the assessment of subgroups (European Medicines Agency Citation2019). A second source of evidence telling the same story can lend strong support to an inconsistency of treatment effects across subgroups, even without significant p-values. If, however, a finding fails to be replicated—quite common of course where the result is a play of chance—the finding can be dismissed as such, even if the interaction test was statistically significant.

Another interesting example is the reliance on statistical significance, indeed strong control of Type I error, across a family of secondary endpoints to determine information that is deemed appropriate for product labeling. Historically at least and, per guidance, FDA have been somewhat more strict with regard to error control across families of secondary endpoints than EMA (European Medicines Agency Citation2017; U.S. Department of Health and Human Services Food and Drug Administration Citation2017). A motivation to avoid inappropriate “claims” and hence to control the advertising and communication with healthcare providers that is permitted is fully understood. However, having established that there is sufficient strength of evidence on variables that demonstrate efficacy and support approval, it is not obvious that the link between significant statistical tests and information thought reliable to communicate to prescribers needs to be so strong. Indeed, if a variable measures an effect of treatment that is relevant to the prescriber’s decision making, it seems desirable to communicate appropriate information along with a description of uncertainties including, if appropriate, confidence intervals and (perhaps nonsignificant) p-values from statistical tests.

5 Discussion

The article by Eric Gibson is informative and timely. A key conclusion of Gibson’s (Citation2020) article is the need for education. This is easily agreed.

While the world has accepted requirements to label certain results as “exploratory,” “secondary,” or “post hoc,” but perhaps those terms pay only lip-service and are not impactful on authors and on reviewers who might consider using naïve, unadjusted results for interpretation, decision-making and as a basis to plan subsequent experiments. Gibson’s specific proposals to interpret p-values as continuous on a log scale and describe the scope of nonnull effects that are excluded can be illustrative, as might a greater use of the other summary measures described in the article. The use of methods to discount effect sizes when planning future studies makes good sense of course and is, to my understanding, widely applied already by pharmaceutical companies in drug development. Of Gibson’s four-part guide, the application of appropriate adjustments for multiplicity and selective inference might have the greatest positive impact however, certainly when considering interpretation of results from exploratory studies. A quantitative adjustment to the results that are available for interpretation would likely be more impactful than the qualitative labels given above. How should this be done? Perhaps industry best-practice can be developed, or internal company standard operating procedures? Either way, it seems particularly important to document scenarios and agree methods in advance, before a problematic dataset becomes available. The community of statisticians is already at risk of being seen as laggards, enforcing the rules and standards of good practice in clinical trial design, analysis and interpretation. Being the team member with responsibility for appropriate shrinking of interesting effect sizes or measures for strength of evidence puts the statistician again in a difficult position. Better then to be the good cops in educating and seeking agreement prospectively, rather than the bad cops needing to influence a conversation once attractive data are available and the excessive optimism has set in.

As for the wider debate on the maintenance or abandonment of p-values, I will side with those arguing for the former: used for an appropriate purpose, correctly interpreted, and not to the exclusion of complementary or alternative approaches where those are also instructive.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.