Armed with our Power surface, let’s revisit John Ioannidis’s claim according to which ‘Most Published Research Findings Are False’.
Ioannidis’s target is the immeasurable confusion generated by the widespread mistake of interpreting Fisher’s 5% statistical significance as implying a high probability that the hypothesis of interest is true. FPR<5%, hence PP>95%. As we have seen, this is far from being the general case: TPR=PP if and only if BO=FPR/FNR, which under prior indifference requires error symmetry.
Fisher’s 5% significance is neither a sufficient not a necessary condition for accepting the hypothesis of interest. Besides FPR, acceptance and rejection depend on BR, PP and TPR. Given FPR=5%, all combinations of the three variables lying on or above the curved surface indicate acceptance. But, contrary to Fisher’s criterion, combinations below the surface indicate rejection. The same is true for values of FPR below 5%, which are even more ‘significant’ according to Fisher’s criterion. These widen the curved surface and shrink the roof, thus enlarging the scope for acceptance, but may still indicate rejection for low priors and high standards of proof, if TPR power is not or cannot be high enough. On the other hand, TPR values above 5%, which according to Fisher’s criterion are ‘not significant’ and therefore imply unqualified rejection, reduce the curved surface and expand the roof, thus enlarging the scope for rejection, but may still indicate acceptance for higher priors and lower standards of proof, provided TPR power is high enough. Here are pictures for FPR=2.5% and 10%:
So let’s go back to where we started. We want to know whether a certain claim is true or false. But now, rather than seeing it from the perspective of a statistician who wants to test the claim, let’s see it from the perspective of a layman who wants to know if the claim has been tested and whether the evidence has converged towards a consensus, one way or the other.
For example: ‘Is it possible to tell whether milk has been poured before or after hot water by just tasting a cup of tea?’ (bear with me please). We google the question and let’s imagine we get ten papers delving into this vital issue. The first and earliest is by none other than the illustrious Ronald Fisher, who performed it on the algologist Muriel Bristol and, on finding that she made no mistakes with 8 cups – an event that has only 1/70 probability of being the product of chance, i.e. a p-value of 1.4%, much lower than required by his significance criterion – concluded, against his initial scepticism, that ‘Yes, it is possible’. That’s it? Well, no. The second paper describes an identical test performed on the very same Ms Bristol three months later, where she made 1 mistake – an event that has a 17/70 probability of being a chance event, i.e. a p-value of 24.3%, much larger than the 5% limit allowed by Fisher’s criterion. Hence the author rejected Professor Fisher’s earlier claim about the lady’s tea-tasting ability. On to the third paper, where Ms Bristol was given 12 cups and made 1 mistake, an event with a 37/924=4% probability of being random, once again below Fisher’s significance criterion. And so on with the other papers, each one with its own set up, its own numbers and its own conclusions.
It is tempting at this point for the layman to throw up his arms in despair and execrate the so-called experts for being unable to give a uniform answer to such a simple question. But he would be entirely wrong. The evidential tug of war between confirmative and disconfirmative evidence is the very essence of science. It is up to us to update our prior beliefs through multiplicative accumulation of evidence and to accept or reject a claim according to our standard of proof.
If anything, the problem is the opposite: not too much disagreement but – and this is Ioannidis’s main point – too little. Evidence accumulation presupposes that we are able to collect the full spectrum, or at least a rich, unbiased sample of all the available evidence. But we hardly ever do. The evidence we see is what reaches publication, and publication is naturally skewed towards ‘significant’ findings. A researcher who is trying to prove a point will only seek publication if he thinks he has gathered enough evidence to support it. Who wants to publish a paper about a theory only to announce that he has got inadequate evidence for it? And even if he tried, what academic journal would publish it?
As a result, available evidence tends to be biased towards acceptance. And since acceptance is still widely based on Fisher’s criterion, most published papers present FPR<5%, while those with FPR>5% remain unpublished and unavailable. To add insult to injury, in order to reach publication some studies get squeezed into significance through more or less malevolent data manipulation. It is what Ioannidis calls bias: ‘the combination of various design, data, analysis, and presentation factors that tend to produce research findings when they should not be produced’. (p. 0697).
This dramatically alters the evidential tug of war. It is as if, when looking into the milk-tea question, we would only find Fisher’s paper and others accepting the lady’s ability – including some conveniently glossing over a mistake or two – and none on the side of rejection. We would then be inclined to conclude that the experts agree and would be tempted to go along with them – perhaps disseminating and reinforcing the bias through our own devices.
How big a problem is this? Ioannidis clearly thinks it is huge – hence the dramatic title of his paper, enough to despair not just about some experts but about the entire academic community and its scientific enterprise. Is it this bad? Are we really swimming in a sea of false claims?
Let’s take a better look. First, we need to specify what we mean by true and false. As we know, it depends on the standard of proof, which in turn depends on utility preferences. What is the required standard of proof for accepting a research finding as true? Going by the wrongful interpretation of Fisher’s 5% significance criterion, it is PP>95%. But this is not only a mistake: it is the premise behind an insidious misrepresentation of the very ethos of scientific research. Obviously, truth and certainty are the ultimate goals of any claim. But the value of a research finding is not in how close it is to the truth, but in how closer it gets us to the truth. In our framework, it is not in PP but in LR.
The goal of scientific research is finding and presenting evidence that confirms or disconfirms a specific hypothesis. How much more (or less) likely is the evidence if the hypothesis is true than if it is false? The value of evidence is in its distance from the unconfirmative middle point LR=1. A study is informative, hence worth publication, if the evidence it presents has a Likelihood Ratio significantly different from 1, and is therefore a valuable factor in the multiplicative accumulation of knowledge. But a high (or low) LR is not the same as a high (or low) PP. They only coincide under prior indifference, where, more precisely, PO=LR, i.e. PP=LR/(1+LR). So, for example, if LR=19 – the evidence is 19 times more likely if the hypothesis is true than if it is false – then PP=19/20=95%. But, as we know very well, prior indifference is not a given: it is a starting assumption which, depending on the circumstances, may or may not be valid. BO – Ioannidis calls it R, ‘the ratio of “true relationships” to “no relationships”‘ (p. 0696) – gives the pre-study odds of the investigated hypothesis being true. It can be high, if the hypothesis has been tested before and is already regarded as likely true, or low, if it is a novel hypothesis that has never been tested and, if true, would be an unexpected discovery. In the first case, LR>1 is a further confirmation that the hypothesis should be accepted as true – a useful but hardly noteworthy exercise that just reinforces what is already known. On the other hand, LR<1 is much more interesting, as it runs against the established consensus. LR could be as low as to convert a high BO into a low PO, thus rejecting a previously accepted hypothesis. But not necessarily: while lower than BO, PO could remain high, thus keeping the hypothesis true, while casting some doubts on it and prodding further investigation. In the second case, LR<1 is a further confirmation that the hypothesis should be rejected as false. On the other hand, LR>1 increases the probability of an unlikely hypothesis. It could be as high as to convert a low BO into a high PO, thus accepting what was previously an unfounded conjecture. But not necessarily: while higher than BO, PO could remain low, thus keeping the hypothesis false, but at the same time stimulating more research.
Such distinctions get lost in Ioannidis’s sweeping claim. True, SUTCing priors and neglecting a low BO can lead to mistakenly accepting hypotheses on the basis of evidence that, while confirmative, leaves PO well below any acceptance level. The mistake is exacerbated by Fisher’s Bias – confusing significance (a low FPR) with confirmation (a high PP) – and by Ioannidis’s bias – squeezing FPR below 5% through data alteration. FPR<5% does not mean PP>95% or even PP>50%. As shown in our Power surface, for any standard of proof, the lower is BO the higher is the required TPR for any level of FPR. Starting from a low BO, accepting the hypothesis requires very powerful evidence. Without it, acceptance is a false claim. Moreover, published acceptances – rightful or wrongful – are not adequately counterbalanced by rejections, which remain largely unpublished. This however occurs for an entirely legitimate reason: there is little interest in rejecting an already unlikely hypothesis. Interesting research is what runs counter prior consensus. Starting from a low BO, any confirmative evidence is interesting, even when it is not powerful enough to turn a low BO into a high PO. Making an unlikely hypothesis a bit less unlikely is interesting enough, and is worth publication. But – here Ioannidis is right – it should not be confused with acceptance. Likewise, there is little interest in confirmative evidence when BO is already high. What is interesting in this case is disconfirmative evidence, again even when it is not powerful enough to reject the hypothesis by turning a high BO into a low PO. Making a likely hypothesis a bit less likely is interesting enough. But it should not be confused with rejection.