THE UNIVERSITY OF BRITISH COLUMBIA
To the Editor Mr Wayant and colleagues evaluated the effect of lowering the significance threshold from .05 to .005 on major randomized clinical trials (RCTs) published in 2017.(1) The authors reported that 70.7% of primary end points remained significant and suggested that lowering the threshold might address statistical issues such as P-hacking.
The probability that a positive finding in a study is a true positive—the positive predictive value (PPV)—depends on a priori knowledge (ie, prior probability) of the replicability of the study findings.(2) The rationale behind the proposed .005 threshold is that a P value of .05 does not correspond to reasonably high PPVs.(3) Although this might be true for early-phase trials, it does not apply to phase 3 RCTs. Major RCTs have a 69% prior probability of being successfully replicated.(4) Thus, a “positive” RCT would have a 97% chance of being a true positive when the P value is .05 (PPV, 97%).
The original proposal of switching to a .005 P value threshold correctly limited its recommendation to “claims of new discoveries.”(3) These types of studies (eg, basic science, preclinical studies, and early-phase trials) have much lower probabilities of successful replication (as low as 9%3) and thus lower PPVs (as low as 53%). An intervention that has made it to a large phase 3 RCT has already been through extensive testing and most false-positive findings have been ruled out.
Lowering the significance threshold to .005 for large phase 3 trials would lead to larger and more expensive trials than the current standards for little added benefit. The .005 threshold would require 70% more participants than would studies powered on a .05 level to achieve statistical significance for the same effect size.(3)
Many statisticians find the principle of relying on a single P value threshold for deciding on the positivity or negativity of studies as arbitrary and flawed.(4) The American Statistical Association’s statement on P values explicitly rejects using such “bright-line rules” for policy decisions and scientific conclusions.(5) We share these concerns. A P value is a continuous measure of evidence and thus best interpreted in the context of the study. A significance threshold of any value encourages publication bias and P-hacking and should be avoided when possible. However, if a P value significance threshold has to be used for a phase 3 RCT, .05 is good enough. For now, there is no compelling reason to lower the P value threshold for late-phase RCTs.
References: