Back to PO=LR∙BO.

Whether we accept or reject a hypothesis, i.e. decide whether a claim is true or false, depends on all three elements.

Posterior Odds. The minimum standard of proof required to accept a hypothesis is PO>1 (i.e. PP>50%). We call it *Preponderance of evidence*. But, depending on the circumstances, this may not be enough. We have seen two other cases: *Clear and convincing evidence*: PO>3 (i.e. PP>75%), and *Evidence beyond reasonable doubt*: PO>19 (PP>95%), to which we can add *Evidence beyond the shadow of a doubt*: PO>99 (PP>99%) or even PO>999 (PP>99.9%). The spectrum is continuous, from 1 to infinity, where *Certainty* (PP=100%) is unattainable and is therefore a *decision*. The same is symmetrically true for rejecting the hypothesis, from PO<1 to the other side of Certainty: PO=PP=0.

Base Odds. To reach the required odds we have to start somewhere. A common starting point is *Prior indifference*, or Perfect ignorance: BO=1 (BR=50%). But, depending on the circumstances, this may not be a good starting place. With BO=1 it looks like Base Odds have disappeared, but they haven’t: they are just being ignored – which is never a good start. Like PO, Base Odds are on a continuous spectrum between the two boundaries of *Faith*: BR=100% and BR=BO=0. Depending on BO, we need more or less evidence in order to achieve our required PO.

Likelihood Ratio. Evidence is *confirmative* if LR>1, i.e. TPR>FPR, and *disconfirmative* if LR<1, i.e. TPR<FPR. The size of TPR and FPR are not relevant *per se* – what matters is their ratio. A high TPR means nothing without a correspondingly low FPR. Ignoring this leads to the Confirmation Bias. Likewise, a low FPR means nothing without a correspondingly high TPR. Ignoring this leads to Fisher’s Bias.

To test a hypothesis, we start with a BO level that best reflects our priors and set our required standard of proof PO. The ratio of PO to BO determines the required LR: the strength or weight of the evidence we demand to accept the hypothesis. In our tea-tasting story, for example, we have BO=1 (BR=50%) and PO>19 (PP>95%), giving LR>19: in order to accept the hypothesis that the lady has some tea-tasting ability, we require evidence that is at least 19 times more likely if the hypothesis is true than if the hypothesis is false. A test is designed to calculate FPR: the probability that the evidence is a product of chance. This requires defining a random variable and assigning to it a probability distribution. Our example is an instance of what is known as Fisher’s exact test, where the random variable is the number of successes over the number of trials without replacement, as described by the hypergeometric distribution. Remember that with 8 trials the probability of a perfect choice under the null hypothesis of no ability is 1/70, the probability of 3 successes and 1 failure is 16/70, and so on. Hence, in case of a perfect choice we accept the hypothesis that the lady has some ability if TPR>19∙(1/70)=27% – a very reasonable requirement. But with 3 successes and 1 failure we would require an impossible TPR>19∙(17/70). On the other hand, if we lower our required PO to 3 (PP>75%), then all we need is TPR>3∙(17/70)=73% – a high but feasible requirement. But if we lower our BO to a more sceptical level, e.g. BO=1/3 (BR=25%), then TPR>3∙3∙(17/70) is again too high, whereas a perfect choice may still be acceptable evidence of some ability, even with the higher PO: TPR>3∙19∙(1/70)=81%.

So there are four variables: PP, BR, FPR and TPR. Of these, PP is set by our required standard of proof, BR by our prior beliefs and FPR by the probability distribution of the relevant random variable. These three combined give us the minimum level of TPR required to accept the hypothesis of interest. TPR – the probability of the evidence in case the hypothesis is true – is also known as statistical power, or sensitivity. Our question is: given our starting priors and required standard of proof, and given the probability that the evidence is a chance event, how *powerful* should the evidence be for us to decide that it is *not* a chance event but a sign that the hypothesis of interest is true?

Clearly, the lower is FPR, the more inclined we are to accept. As we know, Fisher would do it if FPR<5% – awkwardly preferring to declare himself able to disprove the null hypothesis at such significance level. That was enough for him: he apparently took no notice of the other three variables. But, as we have seen, what he might have been doing was implicitly assuming error symmetry, prior indifference and setting PP beyond reasonable doubt, thus requiring TPR>95%, i.e. LR>19. Or, more likely, at least in the tea test, he was starting from a more sceptical prior (e.g. 25%), while at the same time lowering his standard of proof to e.g. 75%, which at FPR=5% requires TPR>45%, i.e. LR>9, or perhaps to 85%, which requires TPR>85%, i.e. LR>17.

There are many combinations of the four variables that are consistent with the acceptance of the hypothesis of interest. To see it graphically, let’s fix FPR: imagine we just ran a test and calculated that the probability that the resulting evidence is the product of chance is 5%. Do we accept the hypothesis? Yes, says Fisher, But we say: it depends on our priors and standard of proof. Here is the picture:

For each BR, TPR is a positively convex function of PP. For example, with prior indifference (BR=50%) and a minimum standard of proof (PP>50%) all we need to accept the hypothesis is TPR>5% (i.e. LR>1): the hypothesis is more likely to be true than false. But with a higher standard, e.g. PP>75%, we require TPR>15% (LR>3), and with PP>95% we need TPR>95% (LR>19). The requirement gets steeper with a sceptical prior. For instance, halving BR to 25% we need TPR>15% for a minimum standard and TPR>45% for PP>75%. But PP>95% would require TPR>1: evidence is not powerful enough for us to accept the hypothesis beyond reasonable doubt. For each BR, the maximum standard of proof that keeps TPR below 1 is BO/(BO+FPR). Under prior indifference, that is 95% (95.24% to be precise: PO=20), but with BR=25% it is 87%. The flat roof area in the figure indicates the combination of priors and standards of proof which is incompatible with accepting the hypothesis at the 5% FPR level.

If TPR is on or above the curved surface, we accept the hypothesis. But, unlike Fisher, if it is below we reject it: despite a 5% FPR, the evidence is not powerful enough for our priors and standard of proof. Remember we don’t need to calculate TPR precisely. If, as in the tea-tasting story, the hypothesis of interest is vague – the lady has some unspecified ability – it might not be possible. But what we can do is to assess whether TPR is above or below the required level. If we are prior indifferent and want to be certain beyond reasonable that the hypothesis is true, we need TPR>95%. But if we are happy with a lower 75% standard then all we need is TPR>15%. If on the other hand we have a sceptical 25% prior, there is no way we can be certain beyond reasonable doubt, while with a 75% standard we require TPR>45%.

It makes no sense to talk about significance, acceptance and rejection without first specifying priors and standards of proof. In particular, very low priors and very high standards land us on the flat roof corner of the surface, making it impossible for us to accept the hypothesis. This may be just fine – there many hypotheses that I am too sceptical and too demanding to be able to accept. At the same time, however, I want to keep an open mind. But doing so does not mean reneging my scepticism or compromising my standards of proof. It means looking for more compelling evidence with a lower FPR. That’s what we did when the lady made one mistake with 8 cups. We extended the trial to 12 cups and, under prior indifference and with no additional mistakes, we accepted her ability beyond reasonable doubt. Whereas, starting with sceptical priors, acceptance required lowering the standard of proof to 75% or extending the trial to 14 cups.

To reduce the size of the roof we need evidence that is less likely to be a chance event. For instance, FPR=1% shrinks the roof to a small corner, where even a sceptical 25% prior allows acceptance up to PP=97%. In the limit, as FPR tends to zero – there is no chance that the evidence is a random event – we *have* to accept. Think of the lady gulping 100 cups in a row and spotlessly sorting them in the two camps: even the most sceptical statistician would have no shadow of a doubt to regard this as *conclusive* positive evidence (a Smoking Gun). On the other hand, coarser evidence, more probably compatible with a chance event, enlarges the roof, thus making acceptance harder. With FPR=10%, for instance, the maximum standard of proof under prior indifference is 91%, meaning that even the most powerful evidence (TPR=1) is not enough for us to accept the hypothesis beyond reasonable doubt. And with a sceptical 25% prior the limit is 77%, barely above the ‘clear and convincing’ level. While harder, however, acceptance is not ruled out, as it would be with Fisher’s 5% criterion. According to Fisher, a 1 in 10 probability that the evidence is the product of chance is just too high for comfort, making him unable to reject the null hypothesis. But what if TPR is very high, say 90%? In that case, LR=9: the evidence is 9 times more likely if the hypothesis is true than if it false and, under prior indifference, PP=90%. Sure, we can’t be certain beyond reasonable doubt that the hypothesis is true, but in many circumstances it would be eminently reasonable to accept it.

* Also published on Medium. *