Fisher’s 5% significance criterion can be derived from Bayes’ Theorem under error symmetry and prior indifference, with a 95% standard of proof.
Let’s now ask: What happens if we change any of these conditions?
We have already seen the effect of relaxing the standard of proof. While, with a 95% standard, TPR needs to be at least 19 times FPR, with a 75% standard 3 times will suffice. Obviously, the lower our standard of proof the more tolerant we are towards accepting the hypothesis of interest.
Error symmetry is an incidental, non-necessary condition. What matters for significance is that TPR is at least rPO times FPR. So, with rPO=19 it just happens that rTPR=19∙5%=95%=PP. But, for example, with FPR=4% any TPR above 76% – i.e. any FNR below 24% – would do (not, however, with FPR=6%, which would require TPR>1).
Prior indifference, on the other hand, is crucially important. Remember we assumed it at the start of our tea-tasting story, when we gave the lady a 50/50 initial chance that she might be right: BR=50%. Hence BO=1 and Bayes’ Theorem in odds form simplifies to PO=LR. Now it is time to ask: does prior indifference make sense?
Fisher didn’t think so. According to his daughter’s biography, his initial reaction to the lady’s claim that she could spot the difference between the two kinds of tea was: ‘Nonsense. Surely it makes no difference’. Such scepticism would call for a lower BO – not as low as zero, in deference to Cromwell’s rule, but clearly much lower than 1. But Fisher would have adamantly opposed it. For him there was no other value for BO but 1. This was not out of polite indulgence towards the lady, but because of his lifelong credo: ‘I shall not assume the truth of Bayes’ axiom’ (The Design of Experiments, p. 6).
Such stalwart stance was based on ‘three considerations’ (p. 6-7):
1. Probability is ‘an objective quantity measured by observable frequencies’. As such, it cannot be used for ‘measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes.’
This is the seemingly interminable and incredibly vacuous dispute between the objective and the subjective interpretations of probability. It is the same reason why Fisher ignored TPR: if it cannot be quantified, one might as well omit it. He was plainly wrong. As Albert Einstein did not say: ‘Not everything that can be counted counts, and not everything that counts can be counted.’ Science is the interpretation of evidence arranged into explanations. It relies on hard as well as soft evidence. Hard evidence is the result of a controlled, replicable experiment, generating objective, measurable probabilities grounded on empirical frequencies. Soft evidence is everything else: any sign that can help the observer’s effort to evaluate whether a hypothesis is true or false. Such effort arises from a primal need that long predates any theory of probability. The interpretation of soft evidence is inherently subjective. But, contrary to Fisher’s view, there is nothing unscientific about it: subjective probability can be laid out as a complete and coherent theory.
2. ‘My second reason is that it is the nature of an axiom that its truth should be apparent to any rational mind which fully apprehends its meaning. The axiom of Bayes has certainly been fully apprehended by a good many rational minds, including that of its author, without carrying this conviction of necessary truth. This, alone, shows that it cannot be accepted as the axiomatic basis of a rigorous argument.’
This is downright bizarre. First, Bayes’ is a theorem, not an axiom. Second, it is a straightforward consequence of two straightforward definitions. Hence, it is obviously and necessarily true. But somehow Fisher didn’t see it this way. He even believed that Reverend Bayes himself was not completely convinced about it, and that was the reason why he left his Essay unpublished. He had no evidence to support this claim – it was just a prior belief!
3. ‘My third reason is that inverse probability has been only very rarely used in the justification of conclusions from experimental facts’.
This is up there with Decca executive’s Beatles rejection: ‘We don’t like their sounds. Groups of guitars are on the way out’.
Fisher’s credo was embarrassingly wrong. But it doesn’t matter: whether one believes it or not, we are all Bayesian. We all have priors. Ignoring them only means that we are inadvertently assuming prior indifference: BR=50% and BO=1, with all its potentially misleading consequences. Rather than pretending they do not exist, we should try to get our priors right.
So let’s go back to Fisher’s sceptical reception of the lady’s claim. We might at first interpret his prior indifference as neutral open mindedness, expressing perfect ignorance. Maybe the lady is skilled, maybe not – we just don’t know: let’s give her a 50/50 chance and let the data decide. But hang on. Would we say the same and use the same amount of data if the claim had been much more ordinary – e.g. spotting sugar in the tea, or distinguishing between Darjeeling and Earl Grey? And what would we do, on the other hand, with a truly outlandish claim – e.g. spotting whether the tea contains one or two grains of sugar, or whether it has been prepared by a right-handed or a left-handed person? Surely, our priors would differ and we would require much more evidence to test the extraordinary claim than the ordinary one: extraordinary claims require extraordinary evidence.
Prior indifference may be appropriate for a fairly ordinary claim. But the more extraordinary the claim, the lower should be our prior belief and, therefore, the larger should be the amount of confirmative evidence required to satisfy a given standard of proof. For example, let’s share Fisher’s scepticism and halve BR to 25%, hence BO=1/3. Now, with a 95% standard of proof, we have rTPR=3∙19∙(1/70) in case of a perfect choice: it is three times as much as under prior indifference, but we can still accept the hypothesis. More so, obviously, if we relax the standard of proof to 75%. But notice in this case that with one mistake we now have rTPR=3∙3∙(17/70), which is higher than 1. So, while with prior indifference one mistake would still be clear and convincing evidence of some ability, starting with a sceptical prior would lead us to a rejection – coinciding again with Fisher’s conclusion. Did Fisher have an indifferent prior and a 95% standard of proof, or a sceptical prior and a 75% standard? Neither, if we asked him: he shunned priors. But in reality it is either (and – given his scepticism – more likely the latter): we are all Bayesian. In fact, Fisher’s 5% threshold is compatible with various combinations of priors and standards of proof. For instance, BR=25% and an 85% standard give rTPR=85%, where again, incidentally, PP=TPR.
(Note: Prior indifference and error symmetry are sufficient but not necessary conditions for PP=TPR. The necessary condition is BO=FPR/FNR).
Finally, let’s see what happens as we gather more evidence. Remember Fisher ran the experiment with 8 cups. If the lady made no mistake, he accepted the hypothesis that she had some ability; with one or more mistakes, he rejected it. Lowering the standard of proof would tolerate one mistake. But a lower prior would again mean rejection. We can however hear the lady’s protest: Come on, that was one silly mistake – I got distracted for a moment. Give me 4 more cups and I will show you: no more mistakes. As Bayesians, we consent: we allow new evidence to change our mind.
So let’s re-run the experiment with 12 tea cups (The Design of Experiments, p. 21). Now the probability of no mistake goes down to 1/924 (as 12!/[6!(12-6)!]=924), the probability of one mistake to 36/924 (as there are 6×6 ways to choose 5 right and 1 wrong cups), and the probability of one or no mistake to 37/924. Obviously, with no mistakes on 12 cups, the lady’s ability is even more apparent. But now, under prior indifference, one mistake satisfies Fisher’s 5% criterion, as 37/924=4%, and is compatible with a 95% standard of proof, as rTPR=19∙(37/924) is lower than 1. Hence we accept the hypothesis even if the lady makes one mistake. Not so, however, if we start with a sceptical prior: for that we need a 75% standard. Or we need to extend the experiment to 14 cups (by now you know what to do).
To summarise (A=Accept, R=Reject):
There is more to testing a hypothesis than 5% significance. The decision to accept or reject it is the result of a fine balance between standard of proof, evidence and priors: PO=LR∙BO.