 The lady tasting tea‘ is one of the most famous experiments in the history of statistics. Ronald Fisher told the story in the second chapter of The Design of Experiments, published in 1935 and considered since then the Bible of experimental design. Apparently, the lady was right: she could easily distinguish the two kinds of tea. We don’t know the details of the impromptu experiment, but on his subsequent reflection Fisher agreed that ‘an event which would occur by chance only once in 70 trials is decidedly “significant“‘ (p.13). At the same time, however, he found ‘obvious’ that ‘3 successes to 1 failure, although showing a bias, or deviation, in the right direction, could not be judged as statistically significant evidence of a real sensory discrimination’. (p. 14-15). His reason:

It is usual and convenient for experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the great part of the fluctuations which chance causes have introduced into their experimental results. (p. 13).

Statistically significant at the 5% level: that’s where it all started – the most used, misused and abused criterion in the history of statistics.

Where did it come from? Let’s follow Fisher’s train of thought. Remember what we said about the Confirmation Bias: we cannot look at TPR without looking at its associated FPR. Whatever the result of an experiment, there is always a probability, however small, that it is just a product of chance. How small – asked Fisher – should that probability be for us to be comfortable that the result is not the product of chance? How small – in our framework – should FPR be? 5% – said Fisher. If FPR is lower than 5% – as it is with a perfect cup choice – we can safely conclude that the result is significant, and not a chance event. If FPR is above 5% – as it is with 3 successes and 1 failure – we cannot. That’s it – no further discussion. What about TPR? Shouldn’t we look at FPR in relation to TPR? Not according to Fisher: FPR – the probability of observing the evidence if the hypothesis is false – is all that matters. So much so that, with a bewildering flip, he pretended that the hypothesis under investigation was not that the lady could taste the difference, but its opposite: that she couldn’t. He called it the null hypothesis. After the flip, his criterion is: if the probability of the evidence, given that the null hypothesis is true – he calls it the p-value – is less than 5%, the null hypothesis is ‘disproved’ (p. 16). If it is above 5%, it is not disproved.

Why such an awkward twist? Because – said Fisher – only the probability of the evidence under the hypothesis of no ability can be calculated exactly, according to the laws of chance. Under the null hypothesis, the probability of a perfect choice is 1/70, the probability of 3 successes and 1 failure is 16/70, and so on. Whereas the probability of the evidence under the hypothesis of some ability cannot be calculated exactly, unless the ability level is specified. For instance, under perfect ability, the probability of a perfect choice is 100% and the probability of any number of mistakes is 0%. But how can we quantify the probability distribution under the hypothesis of some unspecified degree of ability – which, despite Fisher’s contortions, is the hypothesis of interest and the actual subject of the enquiry? We can’t. And if we can’t quantify it – seems to be the conclusion – we might as well SUTC it: Sweep it Under The Carpet.

How remarkable. This is the Confirmation Bias’s mirror image. The Confirmation Bias is transfixed on TPR and forgets about FPR. Fisher’s Bias does something similar: it focuses on FPR – because it can be calculated – and disregards TPR – because it can’t. The resulting mistake is the same: they both ignore that what matters is not how high TPR is, or how low is FPR, but how they relate to each other.

To see this, let’s assume for a moment that the hypothesis under investigation is ‘perfect ability’ versus ‘no ability’ – no middle ground. In this case, as we just said, TPR=1 for a perfect choice and 0 otherwise. Hence, under prior indifference, we have PO=LR=1/FPR or PO=0. Fisher agrees:

If it were asserted that the subject would never be wrong in her judgements we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. (p. 16).

As we have seen, with a perfect choice over 8 cups we have FPR=1/70 and therefore PO=70 i.e. PP=98.6% (remember PO=PP/(1-PP)). True, it is not conclusive evidence that the lady is infallible – as the number of cups increases, FPR tends to zero but never reaches it – but to all intents and purposes we are virtually certain that she is. Fisher may abuse her patience and feed her a few more cups, and twist his tongue saying that what he did was disprove that the lady is not just lucky. In fact, all he demanded to do so was FPR<5%, i.e. PO>20 and PP>95.2% – a verdict beyond reasonable doubt. On the other hand, even a single mistake provides conclusive evidence to disprove the hypothesis that the lady is infallible – in the same way that a black swan disproves the hypothesis that all swans are white: TPR=0, hence PO=0 and PP=0%, irrespective of FPR.

Let’s now ask: What happens if we replace ‘perfect ability’ with ‘some ability’? The alternative hypothesis is still ‘no ability’, so FPR stays the same. The difference is that we cannot exactly quantify TPR. But we don’t need to. All we need to do is define the level of PP required to accept the hypothesis. This gives us a required PO – let’s call it rPO – which, given FPR, implies a required level of TPR: rTPR=rPO∙FPR. Let’s say for example that the required PP is 95%. Then rPO=19 and rTPR=19∙FPR. Hence, in case of a perfect choice, rTPR=19∙(1/70). At this point all we need to ask is: are we comfortable that the probability of a perfect choice, given that the lady has some ability, is at least 19/70? Remember that the same probability under no ability is 1/70 and under perfect ability is 70/70. If the answer is yes – as it should reasonably be – we accept the hypothesis. There is no need to know the exact value of TPR, as long as we are comfortable that it exceeds rTPR. On the other hand, if the lady makes one mistake we have rTPR=19∙(17/70): the required probability of one or no mistake, given some ability, exceeds 100%. Hence we reject the hypothesis.

This coincides with Fisher’s conclusion, as 1/70 is below 5% and 17/70 is above. But what happens if we lower rPO? After all, 95% is a very high standard of proof: do we really need to be sure beyond reasonable doubt that the lady has some tea tasting ability? What if we are happy with 75%, i.e. rPO=3? In this case, rTPR=3∙(1/70) for a perfect choice – a comfortable requirement, close to no ability. But now with one mistake we have rTPR=3∙(17/70). This is about two thirds of the way between no ability (17/70) and perfect ability (70/70 – remember we need to consider the cumulative probability of one or no mistake: under perfect ability, this is 0%+100%). We may or may not feel comfortable with such a high level, but if we do then we must conclude that there is clear and convincing evidence that the lady has some ability, despite her mistake.

For illustration purposes, let’s push this argument to the limit and ask: what if we lower the standard of proof all the way down to 50%, i.e. rPO=1? In this case, all we would need in order to grant the lady some ability is a preponderance of evidence. This comfortably covers one mistake, and may even allow for two mistakes, as long as we accept rTPR=53/70 (notice there are 6×6 ways to choose 2 right and 2 wrong cups).

This may well be too lenient. But the point is that, as soon as we explicitly relate FPR to TPR, we are able to place Fisher’s significance criterion in a proper context, where his 5% threshold is not a categorical standard but one choice within a spectrum of options. In fact, once viewed in this light, we can see where Fisher’s criterion comes from.

Fisher focused on the probability of the evidence, given the null hypothesis, stating that a probability of less than 5% was small enough to be comfortable that the evidence was ‘significant’ and not a chance event. But why did he then proceed to infer that such significant evidence disproved the null hypothesis? That is: why did he conclude that the probability of the null hypothesis, given significant evidence, was small enough to disprove it? As we know (very well by now!), the probability of E given H is not the same as the probability of H given E. Why did Fisher seem to commit the Inverse Fallacy?

To answer this question, remember that the two probabilities are the same under two conditions: symmetric evidence and prior indifference. Under error symmetry, FPR=FNR. Hence, in our framework, where the hypothesis of no ability is the alternative to the tested hypothesis of some ability, FPR=5% implies TPR=95% and therefore PO=19 and PP=TPR=95%. The result is the same in Fisher’s framework, where the two hypotheses are – unnecessarily and confusingly – flipped around, FPR becomes FNR and the null hypothesis is rejected if FNR is less than 5%.

Since Fisher could not quantify TPR, he avoided any explicit consideration of FNR=1-TPR and its relation with FPR – symmetric or otherwise. But his rejection of the null hypothesis required it: in the same way as we should avoid the Confirmation Bias – accepting a hypothesis based on a high TPR, without relating it to its associated FPR – we need to avoid Fisher’s Bias: accepting a hypothesis – or, as Fisher would have it, disprove a null hypothesis – based on a low FPR, without relating it to its associated TPR.

What level of TPR did Fisher associate to his 5% FPR threshold? We don’t know – and probably neither did he. All he said was that a p-value of less than 5% was low enough to disprove the null hypothesis. Since then, Fisher’s Bias has been a source of immeasurable confusion. Assuming symmetry, FPR<5% has been commonly taken to imply PP>95%: ‘We have tested our theory and found it significant at the 5% level: therefore, there is only a 5% probability that we are wrong.’

Fisher would have cringed at such a statement. But his emphasis on significance inadvertently encouraged it. Evidence is not significant or insignificant according to whether FPR is below or above 5%. It is confirmative or disconfirmative according to whether LR is above or below 1, i.e. TPR is above or below FPR. What matters is not the level of FPR per se, but its relation to TPR. Confirmative evidence increases the probability that the hypothesis of interest is true, and disconfirmative evidence decreases it. We accept the hypothesis of interest if we have enough evidence to move LR beyond the threshold required by our standard of proof. Only then can we call such evidence ‘significant’. So, if our threshold is 95%, then, under error symmetry and prior indifference, FPR<5% implies TPR=PP>95%. There is no fallacy: TPR – the probability of E given H – is equal to PP – the probability of H given E. And 5% significance does mean that we have enough confirmative evidence to decide that the hypothesis of interest has indeed been proven beyond reasonable doubt.

This is where Fisher’s 5% criterion comes from: Bayes’ Theorem under error symmetry and prior indifference, with a 95% standard of proof. Fisher ignored TPR, because he could not quantify it. But TPR cannot be ignored – or rather: it is there, whether one ignores it or not. Fisher’s criterion implicitly assumes that TPR is at least 95%. Without this assumption, a 5% ‘significance’ level cannot be used to accept the hypothesis of interest. Just like the Confirmation Bias consists in ignoring that a high TPR means nothing without a correspondingly low FPR, Fisher’s Bias consists in ignoring that a low FPR means nothing without a correspondingly high TPR.