Jul 302017
 

Investment risk is the probability of a substantial and permanent loss of capital. We buy a stock at 100 expecting to earn a return, consisting of appreciation and possibly a stream of dividends. But our expectation may be disappointed: the price may go down rather than up and we may decide to sell the stock at a loss, either because we need the money or because we come to realise, rightly or wrongly, that we made a mistake and the stock will never reach our expected level.

How does investment risk relate to volatility – the standard deviation of past returns, measuring the extent to which returns have been fluctuating and vibrating around their mean? Clearly, we prefer appreciation to be as quick and smooth as possible. If our expected price level is, say, 150, we would like the stock to reach the target in a straight line rather than through a tortuous rollercoaster. On the other hand, if we are confident that the price will get there eventually, we – unimpressionable grownups – may well endure the volatility. In fact, if on its way to 150 the price dropped to 70 it would create an inviting opportunity to buy more.

Volatility increases investment risk only insofar as it manages to undermine our confidence. We might have rightly believed that Amazon was a great investment at 85 dollars in November 1999, but by the time it reached 6 two years later our conviction would have been brutally battered. Was there any indication at the time that the stock could have had such a precipitous drop? Sure, the price had been gyrating wildly until then, up 21% in November, down 12% in October, up 29% in September and 24% August, down 20% in July, and so on. The standard deviation of monthly returns since the IPO had been 33%, compared to 5% for the S&P500, suggesting that further and possibly more extreme gyrations were to be expected. But to a confident investor that only meant: tighten your seatbelt and enjoy the ride. A 93% nosedive, however, was something else – more than enough to break the steeliest nerves and crush the most assured resolve. ‘I must be wrong, I’m out of here’ is an all too human reaction in such circumstances.

Therefore, while volatility may well contribute to raise investment risk, it is not the same as investment risk. It is only when – rightly or wrongly – conviction is overwhelmed by doubt and poise surrenders to anxiety that investment risk bears its bitter fruit.

Amazon is a dramatic example, but this is true in general. Every investment is made in the expectation of making a return, together with a more or less conscious and explicit awareness that it may turn out to be a flop. Every investor knows this, in practice. So why do many of them ignore it in theory and keep using financial models built on the axiom that volatility equals investment risk? As we have seen, the reason is the intellectual dominance of the Efficient Market Theory.

So the next question is: Why is it that, according to the EMT, investment risk coincides with volatility? The answer is as simple as it is unappreciated. Let’s see.

If the EMT could be summarised in one sentence, it would be: The market price is right. Prices are always where they should be. Amazon at 85, 6 or 1000 dollars. The Nasdaq at 5000, 1400 or 6400. At each point in time, prices incorporate all available information about expected profits, returns and discount rates. Prices are never too high or too low, except with hindsight. Therefore, an investor who buys a stock at 100 because he thinks it is worth 150 is fooling himself. Nobody can beat the market. If the market is pricing the stock at 100, then that’s what it’s worth. The price will change if and only if new information – unknown and unknowable beforehand and therefore not yet incorporated into the current price – prompts the market to revise its valuation. As this was true in the past as it is true in the present and in the future, past price changes must also have been caused by no other reason than the arrival of information that was new at the time and unknown until then. Thus all price changes are unknowable and, by definition, unexpected. And since price changes are the largest components of returns – the other being dividends, which can typically be anticipated to some extent – we must conclude that past returns are largely unexpected. At this point there is only one last step: to identify risk with the unexpected. If we define investment risk as anything that could happen to the stock price that is not already incorporated into its current level, then the volatility of past returns can be taken as its accurate measure.

Identifying investment risk with volatility presupposes market efficiency. This is part of what Eugene Fama calls the joint hypothesis problem. To be an active investor, thus rejecting the EMT in practice, while at the same time using financial models based on the identification of investment risk with volatility, thus assuming the EMT in theory, is a glaring but largely unnoticed inconsistency.

So the next question is: what is it that practitioners know and makes them behave as active investors, and EMT academics ignore and leads them to declare active investment an impossible waste of time and to advocate passive investment?

Again, the answer is simple but out of sight. In a nutshell: Practitioners know by ample experience that investors have different priors. EMT academics assume, by theoretical convenience, that investors have common priors.

Different priors is the overarching theme of the entire Bayes blog. People can and do reach different conclusions based on the same evidence because they interpret evidence based on different prior beliefs. This is blatantly obvious everywhere, including financial markets, where, based on the same information, some investors love Amazon and some other short it. In the hyperuranian realm of the EMT, on the other hand, investors have common priors and therefore, when faced with common knowledge, cannot but reach the same conclusion. As Robert Aumann famously demonstrated, they cannot agree to disagree. This is why, in EMT parlance, prices reflect all available information.

Take the assumption away and the whole EMT edifice comes tumbling down. This is what Paul Samuelson was referring to in the final paragraphs on the Fluctuate and Vibrate papers. More explicitly, here is how Jonathan Ingersoll put it in his magisterial Theory of Financial Decision Making, immediately after ‘proving’ the EMT:

In fact, the entire “common knowledge” assumption is “hidden” in the presumption that investors have a common prior. If investors did not have a common prior, then their expectations conditional on the public information would not necessarily be the same. In other words, the public information would properly also be subscripted as φk – not because the information differs across investors, but because its interpretation does.

In this case the proof breaks down. (p. 81).

Interestingly, on a personal note, I first made the above quotation in my D.Phil. thesis (p. 132). A nice circle back to the origin of my intellectual journey.

Print Friendly, PDF & Email
Jul 012017
 

As he wrote his ‘Challenge to Judgment’ on the first issue of the Journal of Portfolio Management in 1974, Paul Samuelson expected ‘the world of practical operators’ and ‘the new world of academics’ – which at the time looked to him ‘still light-years apart’ – to show some degree of convergence in the future.

On the face of it, he was right. The JPM recently celebrated its 40th anniversary. The Financial Analysts Journal, started with the same bridging intent 30 years earlier under Ben Graham’s auspices, is alive and well on its 73rd Volume. Dozens other periodicals have joined in the effort and hundreds of books and manuals have been written, sharing the purpose of promoting and developing a common language connecting the practice and the theory of investing.

But, while presuming and pretending to understand each other, the two worlds are still largely immersed in a sea of miscommunication. At the base of the Babel there are two divergent perspectives on the relationship between risk and return. Everybody understands return. You buy a stock at 100 and the price goes up to 110 – that’s a 10% return. But this is ex post. What was your expected return before you bought? And what risk did you assume? The practical operator does not have precise answers to these questions. I looked at the company – he would say – studied its business, read its balance sheet, talked to the managers, did my discounted cash flow valuation and concluded that the company was worth more than 100 per share. So I expected to earn a good return over time, roughly equal to the gap between my intrinsic value estimate and the purchase price. As for risk, I knew my valuation could be wrong – the company might be worth less than I thought. And even if I was right at the time of purchase, the company and my investment might have taken wrong turns in myriads different ways, causing me to lose some or all of my money.

Is that it? – says the academic – is that all you can say? Of course not – replies the operator – I could elaborate. But I couldn’t do it any better than Ben Graham: read his books and you’ll get all the answers.

But the academic would have none of it. As Eugene Fama recalls: ‘Without being there one can’t imagine what finance was like before formal asset pricing models. For example, at Chicago and elsewhere, investment courses were about security analysis: how to pick undervalued stocks’. (My Life in Finance, p. 14). Go figure. Typically confusing science with precision, the academic is not satisfied until he can squeeze concepts into formulas and insights into numbers. I don’t know what to do with Graham’s rhetoric – he says – I need measurement. So let me repeat my questions: what was your expected return exactly? How did you quantify your risk?

Give me a break – says the defiant operator – risk is much too complex to be reduced to a number. As for my expected return, I told you it is the gap between value and price, but I am under no illusion that I know it exactly. All I know is that the gap is large enough and I am prepared to wait until it closes.

Tut-tut – Fama shakes his head – Listen to me, you waffly retrograde. I will teach you the CAPM. ‘The CAPM provides the first precise definition of risk and how it drives expected return, until then vague and sloppy concepts’. (p. 15).

The operator listens attentively and in the end says: Sorry, I think the CAPM is wrong. First, you measure risk as the standard deviation of past returns. You do it because it gives you a number, but I think it makes little sense. Second, you say the higher the risk the higher is the expected return. That makes even less sense. My idea of risk is that the more there is the more uncertain I am about my expected return. In my view, the relationship between risk and expected return is, if anything, negative. So thank you for the lecture, but I stick with Graham. As Keynes did not say (again!): It is better to be vaguely right than precisely wrong.

Writing ten years after Samuelson’s piece, Warren Buffett well expressed the chasm between academics and practical operators: ‘Our Graham & Dodd investors, needless to say, do not discuss beta, the capital asset pricing model or covariance in returns among securities. These are not subjects of any interest to them. In fact, most of them would have difficulty defining those terms. The investors simply focus on two variables: price and value’. (Buffett, Superinvestors, p. 7).

But operators are rarely so blunt. Such is the intellectual authority of the Efficient Market Theory that the identification of risk with the standard deviation of returns – a.k.a. volatility – and the implication that more risk means higher returns are taken for granted and unthinkingly applied to all sorts of financial models. Hilariously, these include the same valuations that investment practitioners employ to justify their stock selection – an activity that makes sense only if one rejects the EMT! It is pure schizophrenia: investors unlearn at work what they learned at school, while at the same time continuing to use many of the constructs of the rejected theory and failing to notice the inconsistency.

But here is the biggest irony: after teaching it for forty years – twenty after Buffett’s piece – Fama finally got it out of his system: ‘The attraction of the CAPM is that it offers powerful and intuitively pleasing predictions about how to measure risk and the relation between expected return and risk. Unfortunately, the empirical record of the model is poor – poor enough to invalidate the way it is used in applications (Fama and French, JEP 2004). Hallelujah. Never mind that in the meantime the finance world – academics and practitioners – had amassed a colossal quantity of such applications and drawn an immeasurable variety of invalid conclusions. But what is truly mindboggling is that, in spite of it all, the CAPM is still regularly taught and widely applied. It is hard to disagree with Pablo Fernandez – a valiant academic whose work brings much needed clarity amidst the finance Babel – when he calls this state of affairs unethical:

If, for any reason, a person teaches that Beta and CAPM explain something and he knows that they do not explain anything, such a person is lying. To lie is not ethical. If the person “believes” that Beta and CAPM explain something, his “belief” is due to ignorance (he has not studied enough, he has not done enough calculations, he just repeats what he heard to others…). For a professor, it is not ethical to teach about a subject that he does not know enough about.

Two books that I think are particularly effective in helping operators move from practical unlearning – erratic, undigested and incoherent – to proper intellectual unlearning of the concept of risk embedded in the EMT and its derivations are David Dreman’s Contrarian Investment Strategies: the Next Generation (particularly Chapter 14: What is Risk?) and Howard Marks’ The Most Important Thing (particularly Chapters 5-7 on Understanding, Recognizing and Controlling Risk).

Besides the EMT’s predominance, unlearning is necessary because, at first glance, measuring risk with the standard deviation of returns makes intuitive sense: the more prices ‘fluctuate’ and ‘vibrate’, the higher the risk. Take Amazon:

If you had invested 30,000 dollars in Amazon’s IPO in May 1997 (it came out at 18 dollars, equivalent to 1.50 dollars after three splits), after twenty years – as the stock price reached 1,000 dollars (on 2nd June this year, to be precise) – your investment would have been be worth 20 million dollars. Everybody understands return. But look at the chart – in log scale to give a graphic sense of what was going on: 1.50 went to 16 in a year (+126% in one month – June 1998) to reach 85 in November 1999. Then in less than two years – by September 2001 – it was down to 6, only to climb back to 53 at the end of 2003, down to 27 in July 2006, up to 89 in October 2007, down to 43 in November 2008 and finally up – up up up – to 1000. Who – apart from Rip van Winkle and Jeff Bezos – would have had the stomach to withstand such infernal rollercoaster?

So yes, in a broad sense, volatility carries risk. The more violent the price fluctuations, the higher is the probability that, for a variety of psychological and financial circumstances – he may get scared and give up on his conviction, or he may need to liquidate at the wrong time – an investor might experience a catastrophic loss. But how can such probability be measured? The routine, automatic answer is: the standard deviation of returns. Here is the picture:

The graph on the left is the cumulative standard deviation of monthly returns from May 1997 (allowing for an initial 12-month data accumulation) to May 2017, for Amazon and for the S&P500 index. The graph on the right shows the 12-month rolling standard deviations. The cumulative graph, which uses the maximal amount of data, shows that while the monthly standard deviation of the S&P500 has been stable at around 5%, Amazon’s standard deviation has been, after an initial peak, steadily declining ever since, although it still remains about four times that of the index (18.4% vs. 4.4%). The 12-month rolling version shows a similar gap, with Amazon’s standard deviation currently about three times that of the S&P500 (5.1% vs. 1.8%).

What does this mean? Why is it relevant? What can such information tell us about the probability that, if we buy Amazon today, we may incur a big loss in the future? A moment’s thought gives us the answer: very little. Clearly, today’s Amazon is a completely different entity compared to its early days in the ’90s. Using any data from back then to guide today’s investment decision is nothing short of mindless. Amazon today is not four times as risky as the market, as it wasn’t five times as risky in November 2008. Nor is it three times as risky, as implied by the 12-month rolling data. The obvious point is that the standard deviation of returns is a backward-looking, time-dependent and virtually meaningless number, which, contrary to the precision it pretends to convey, has only the vaguest relation to anything resembling what it purports to measure.

The same is true for the other CAPM-based, but still commonly used measure of risk: beta. Here is Amazon’s beta versus the S&P500 index, again cumulative and on a 12-month rolling basis:

Again, the cumulative graph shows that Amazon’s beta has always been high, though it has halved over time from 4 to 2. So is Amazon a high beta stock? Not according to the 12-month rolling measure, which today is 0.4 – Amazon is less risky than the market! – but has been all over the place in the past, from as high as 6.7 in 2007 to as low as -0.4 in 2009. Longer rolling measures give a similar picture. What does it mean? Again, very little. According to the CAPM, Amazon’s beta is supposed to be a constant or at least stable coefficient, measuring the stock’s sensitivity to general market movements. But in reality it is nothing of the kind: like the standard deviation of returns, beta is just an erratic, retrospective and ultimately insignificant number.

Volatility implies risk. But reducing risk to volatility is wrong, ill-conceived and in itself risky, as it inspires the second leg of the CAPM misconception: the positive relationship between risk and expected return. ‘Be brave, don’t worry about the rollercoaster – you’ll be fine in the end and you’ll get a premium. The more risk you are willing to bear, the higher the risk premium you will earn.’ Another moment’s reflection is hardly necessary to reveal the foolishness – and to commiserate the untold damage – of such misguided line of reasoning. The operator’s common sense view is correct: once risk is properly defined as the probability of a substantial and permanent loss of capital, the more risk there is the lower – not the higher – is the probability-weighted expected return. This also requires unlearning – often, alas, the hard way.

Despite Samuelson’s best wishes, then, there is far less authentic common ground between operators and academics than what is pretended – in more or less good faith – in both camps. Operators are right: there is much more to risk than volatility and beta, and actual risk earns no premium.

So the next question becomes: what prevents academics from seeing it?

More in the next post.

Print Friendly, PDF & Email
May 262017
 

In the latest chapter of his life-long and eventually triumphal effort to promote index investing, John Bogle explains what lays at the foundation of his philosophy: ‘my first-hand experience in trying but failing to select winning managers’ (p. 6). In 1966, as the new 37-year old CEO of Wellington Management Company, Bogle decided to merge the firm with ‘a small equity fund manager that jumped on the Go-Go bandwagon of the late 1960s, only to fail miserably in the subsequent bear market. A great – but expensive – lesson’ (p. 7), which cost him his job.

It reminded me of another self-confessed failure, as recounted by Eugene Fama, who in his young days worked as a stock market forecaster for his economics professor, Harry Ernst: ‘Part of my job was to invent schemes to forecast the market. The schemes always worked on the data used to design them. But Harry was a good statistician, and he insisted on out-of-sample tests. My schemes invariably failed those tests’. (My Life in Finance, p. 3).

I can’t help seeing both incidents as instances of Festinger’s cognitive dissonance. It runs more or less like this: 1) I know a lot about economics and stock markets. 2) I am smart – truth be told, very smart. 3) I could use my brains to predict stock prices/select winning managers and make a lot of money. 4) I can’t. Therefore: it must be impossible. I think this goes a long way towards explaining the popularity and intuitive appeal of the Efficient Market Theory in academia.

The theoretical underpinnings of the EMT were set by the Master himself, Paul Samuelson, who in 1965 gave the world the Proof that Properly Anticipated Prices Fluctuate Randomly, followed in 1973 by the Proof that Properly Discounted Present Values of Assets Vibrate Randomly.

Typical academics are keen to take these as conclusive demonstrations – derived from first principles, like Euclidean theorems – of the impossibility of market beating. But the Master knew better. At the end of ‘Fluctuate’ he wrote:

I have not here discussed where the basic probability distributions are supposed to come from. In whose minds are they ex ante? In there any ex post validation of them? Are they supposed to belong to the market as a whole? And what does that mean? Are they supposed to belong to the “representative individual”, and who is he? Are they some defensible or necessitous compromise of divergent expectation patterns? Do price quotations somehow produce a Pareto-optimal configuration of ex ante subjective probabilities? This paper has not attempted to pronounce on these interesting questions.

And at the end of ‘Vibrate’:

In summary, the present study shows (a) there is no incompatibility in principle between the so-called random-walk model and the fundamentalists’ model, and (b) there is no incompatibility in principle between behaviour of stocks’ prices that behave like random walk at the same time that there exists subsets of investors who can do systematically better than the average investors.

Then in 1974 he reiterated the point in crystal clear terms, addressed to both academics and practitioners on the first issue of the Journal of Portfolio Management:

What is at issue is not whether, as a matter of logic or brute fact, there could exist a subset of the decision makers in the market capable of doing better than the averages on a repeatable, sustainable basis. There is nothing in the mathematics of random walks or Brownian movements that (a) proves this to be impossible, or (b) postulates that it is in fact impossible. (Challenge to Judgment, p. 17, his italics).

And for the EMT zealots:

Many academic economists fall implicitly into confusion on this point. They think that the truth of the efficient market or random walk (or, more precisely, fair-martingale) hypothesis is established by logical tautology or by the same empirical certainty as the proposition that nickels sell for less than dimes.

The nearest thing to a deductive proof of a theorem suggestive of the fair-game hypothesis is that provided in my two articles on why properly anticipated speculative prices do vibrate randomly. But of course, the weasel words “properly anticipated” provide the gasoline that drives the tautology to its conclusion. (p. 19).

There goes ‘Bogle’s truth’. And the irony of it is that in his latest piece Bogle reminisces on how, as he read it at the time, ‘Dr. Samuelson’s essay … struck me like a bolt of lightning’ (p. 6). A hard, obnubilating blow indeed.

There was, nevertheless, a legitimate reason for the fulmination. Samuelson’s Challenge to Judgment was a call to practitioners:

What is interesting is the empirical fact that it is virtually impossible for academic researchers with access to the published records to identify any member of the subset with flair. This fact, though not an inevitable law, is a brute fact. The ball, as I have already noted, is in the court of those who doubt the random walk hypothesis. They can dispose of the uncomfortable brute fact in the only way that any fact is disposed of – by producing brute evidence to the contrary. (p. 19).

He was referring to Jensen (1968) and the copious subsequent literature presenting lack of evidence on identifying a consistent subset of long-term outperforming funds. What Samuelson missed, however – and still goes largely unnoticed – is that the ‘risk adjustments’ to fund and index returns used in these studies are based on definitions of risk – as volatility, beta and the like – that presume market efficiency. To his credit, Eugene Fama has always been very clear on this point, which he calls the joint hypothesis problem:

Market efficiency can only be tested in the context of an asset pricing model that specifies equilibrium expected returns. […] As a result, market efficiency per se is not testable. […] Almost all asset pricing models assume asset markets are efficient, so tests of these models are joint tests of the models and market efficiency. Asset pricing and market efficiency are forever joined at the hip. (My Life in Finance, p. 5-6).

Typically, outperforming funds are explained away, and their returns driven to statistical insignificance, by the ‘higher risk’ they are deemed to have assumed. But such risk is defined and measured according to some version of the EMT! It is – as James Tobin wryly put it – a game where you win when you lose (see Tobin’s comment to Robert Merton’s essay in this collection).

It was precisely in defiance of this game that Warren Buffett wrote his marvellous Superinvestors piece, which sits up there next to Ben Graham’s masterwork in every intelligent investor’s reading list. As in his latest shareholder letter, Buffett used the coin-flipping story, fit for humans as well as orangutans, to point out that past outperformance can be the product of chance. But then he drew attention to an important difference:

If (a) you had taken 225 million orangutans distributed roughly as the U.S. population is; if (b) 215 winners were left after 20 days; and if (c) you found that 40 came from a particular zoo in Omaha, you would be pretty sure you were on to something. So you would probably go out and ask the zookeeper about what he’s feeding them, whether they had special exercises, what books they read, and who knows what else. That is, if you found any really extraordinary concentrations of success, you might want to see if you could identify concentrations of unusual characteristics that might be causal factors. (p. 6).

Hence he proceeded to illustrate the track record of his nine Superinvestors, stressing that it was not an ex post rationalisation of past results but a validation of superior stock picking abilities that he had pre-identified ex ante.

So let’s do a thought experiment and imagine that Buffett 2007 went back 40 years to 1967 and wagered a bet: ‘I will give 82,000 dollars (about 500,000 2007 dollars in 1967 money) to any investment pro who can select five funds that will match the performance of the S&P500 index in the next ten years’. Would Buffett 1967 have taken the bet? Sure – he would have said – in fact, I got nine! And after nine years, one year prior to the end of the bet, he would have proclaimed his victory (I haven’t done the calculation on Buffett’s Tables, but I guess it’s right). Now let’s teleport Buffett 2016 to 1976. What would he have said? Would he have endorsed those funds or recommended investing in the then newly launched Vanguard S&P index fund?

Here is then why I am disoriented – and I’m sure I’m not alone – by Mr. Buffett’s current stance on index investing. To be clear: 1) I am sympathetic to his aversion to Buffett impersonators promoting mediocre and expensive hedge funds. 2) I think index funds can be the right choice for certain kinds of savers. 3) I think Jack Bogle is an earnest and honourable man. However, as a grateful and impassioned admirer of Buffett 1984, Buffett 2016 puzzles me. Like the former, the latter agrees with Paul Samuelson against ‘Bogle’s truth’: long term outperformance, while difficult and therefore uncommon – no one denies it – is possible. But while Buffett 1984 eloquently expanded on the ‘intellectual origin’ (p. 6) of such possibility, and on the ex ante characteristics of superior investors, Buffett 2016’s message is: forget about it, don’t fall for ex post performance and stick to index funds.

Notice this is not a message for the general public: it is addressed to Berkshire Hathaway’s shareholders – hardly the know-nothing savers who may be better served by basic funds. Buffett is very clear about this: buying a low-cost S&P500 index fund is his ‘regular recommendation’ (p. 24), to large and small, individual as well professional and institutional investors – noticeably including the trustees of his family estate (2013 shareholder letter, p. 20).

Great! There goes a life-long dedication to intelligent investing. You may as well throw away your copy of Security Analysis. Alternatively, you may disagree with Mr. Buffett – nobody is perfect – and hope he reconsiders his uncharacteristically unfocused analysis. From the Master who taught us how to select good stocks one would expect equivalent wisdom on how to select good funds. It is not the same thing, but there are many similarities. As in stock picking, there are many wrong things one can do in fund picking. Past performance is no guarantee of future performance. Expensive stocks as well as expensive funds deceptively draw investors’ attention. There is no reason why large stocks or large funds should do better than small ones. Don’t go with the crowd. And so on. Similarly, just like Mr. Buffett taught us how to do the right things in stock picking, he could easily impart comparable advice in fund picking.

Here is the first one that comes to mind: look at the first ten stocks in a fund and ask the fund manager why he holds them. If he makes any reference to their index weight, run away.

Print Friendly, PDF & Email
May 202017
 

Look at the top holdings of Italian Equities funds (Azionari Italia) on morningstar.it. They are the same for most of them: ENI, Intesa Sanpaolo, Enel, Unicredit, Luxottica, Assicurazioni Generali, Fiat Chrysler, and so on. Why? Do most fund managers agree that these are the best and most attractive companies quoted on the Italian stock market? No. The reason is that these are the largest companies by market capitalization, and therefore the largest components of the most commonly used Italian Equities index, the FTSE MIB. The same is true for other countries and regions, as well as for sector funds: look at the composition of the relevant index and you will work out a large portion of the funds’ holdings.

To a candid layman this looks very strange. ENI may be a good company, but why should it be as much as 10% of an Italian Equities fund? Surely, a company’s size has nothing to do with how valuable it is as an investment. Aren’t there more attractive choices? And if so, shouldn’t the fund invest in them, rather than park most of the money in the larger companies?

No, is the fund manager’s answer: the fund’s objective is not simply to find attractive investments. It is to obtain over time a better return than its peers and the index. This is what drives investors’ choices, determines the fund’s success and its manager’s reward. To beat the index – says the manager – I have to face it: take it as a neutral position and vary weights around it. So if I think that ENI is fairly valued I will hold its index weight, if I think it is undervalued I will hold more, and if I think it is overvalued I will hold less. How much more or less is up to me. But if ENI is 10% of the index I would have to regard it as grossly overvalued before deciding to hold none of it in the fund. A zero weight would be a huge bet against the index, which, if it goes wrong – ENI does well and I don’t have it – would hurt the fund’s relative performance and my career.

Sorry to insist – says the outspoken layman – but shouldn’t the fund’s performance and your career be better served if you take that 10% and invest it in stocks that you think will do better than ENI? If you do the same with the other large stocks which, like ENI, you hold in the fund just because they are in the index, you may be wrong a few times, but if you are any good at stock picking – and you tell me you are, that’s why I should buy your fund – then surely you are going to do much better than the index. What am I missing?

Look sir, with all due respect – says the slightly irritated manager – let me do my job. You want the fund to outperform, and so do I. So let me decide how best to achieve that goal, if you don’t mind.

I do mind – says the cheeky layman, himself showing signs of impatience. Of course I want you to beat the index. But I want you to do it with all my money, not just some of it. The index is just a measure of the overall market value. If ENI is worth 53 billion euro and the whole Italian stock market is worth 560 billion – less than Apple, by the way – then, sure, ENI is about 10% of the market. But what does that have to do with how much I, you or anybody else should own of it? The market includes all stocks – the good, the bad and the ugly. If you are able to choose the best stocks, you should comfortably do better than the market. If you can’t, I will look somewhere else.

Oh yeah? Good luck with that – the manager has given up his professional demeanour – hasn’t anybody told you that most funds do worse than the index?

Yes, I am aware of it – says the layman – that’s why I am looking for the few funds that can do better. You’re right, if your peers do what you do, I am not surprised they can’t beat the index. But I’ll keep looking. Good bye.

Well done, sir – someone else approaches the layman – let me introduce myself: I am the indexer. You’re right, all this overweight and underweight business is a complete waste of time and money. The reality is that, sooner or later, most funds underperform the index – and they even want to get paid for it! So let me tell you what I do: in my fund, I hold the stocks in the index at exactly their neutral weight, but I charge a small fraction of the other funds’ fees. This way, my fund does better than most other funds, at a much lower cost. How does that sound?

Pretty awful, I must say – says the layman – I am looking for a fund that invests all my money in good stocks and you are proposing one that does none of that and mindlessly buys index stocks. And you call yourself an investor?

Pardon me, but you’re so naïve – says the indexer – I am telling you I do better than most, at a lower cost. What part of the message don’t you understand?

Well, it’s not true – say the layman – and proceeds to show the indexer a list of funds that have done better than the relevant index and the other funds for each category over several periods after all costs – he may be a layman but he’s done his homework.

Oh, that’s rubbish – retorts the indexer – and performs his well-rehearsed coin-tossing gig. These are just the lucky guys who happen to sit on the right tail of the return distribution for a while. Sooner or later, their performance will revert to the mean. And do you know why? Because markets are efficient. Have you heard of the Efficient Market Theory? – he asks with a smug look. There is tons of academic evidence that proves that consistent market beating is impossible.

Yes, I know the EMT – says the layman – and I think it is wrong. Beating the market is clearly difficult – if it were easy everybody could do it, hence nobody would – but it is not impossible. The numbers I just showed you prove my point, and to dismiss them as a fluke is a miserable argument, fit only for haughty academics in need of a soothing answer to a most nagging question: If you’re so smart, why aren’t you rich? Tell me something – continues the layman – what drives market efficiency? Certainly not you, or the other gentleman with his marginal tweaking. You buy any company in the index regardless of price.

Yes – says the indexer, hiding his discomfort – but we are powerful and responsible shareholders and make sure that our voice gets heard.

Give me a break – the layman laughs – companies don’t care about you. They know you have to hold their shares no matter what. You’re the epitome of an empty threat. You don’t even know or care what these companies do. You are not an investor – you’re a free rider.

Ok then – says the indexer (he knew his was a phony argument but he tried it anyway) – what’s wrong with that? If there are enough active investors busy driving prices to where they should be, my passive fund reaps the benefits, my investors pay less and everyone is happy.

You should be ashamed of yourself, you know – says the layman, ready to end his second conversation.

Aw come on now! – blurts the indexer – who’s worse: me, transparently declaring what I do and charging little for it, or the other guy, pretending to be smart, doing worse than me and charging ten times as much?

You’ve got a point there – says the layman – you’re better than him. But you’re not going to get my money either. Good bye.

As you like, it’s your money – says the indexer, before launching his departing salvo: you know, even Warren Buffett says that index investing is the smart thing to do.

I have seen that – says the layman – what was he thinking?

Yes, what was Warren Buffett thinking when in his 2016 shareholder letter he proposed (p. 24) to erect a statue to John Bogle? Let’s see.

Back in the 2005 letter, Buffett prognosticated that active managers would, in aggregate, underperform the US stock market. He was reiterating the ‘fundamental truth’ of index investing. In the latest words of its inventor and proselytiser:

Before intermediation costs are deducted, the returns earned by equity investors as a group precisely equal the returns of the stock market itself. After costs, therefore, investors earn lower-than-market returns. (p. 2)

In its most general sense, this is an obvious tautology: the aggregate return equals the market return by definition. However, ‘Bogle’s truth’ is usually intended to apply as well to mutual funds, which for US equities represent about 20% of the aggregate (see e.g. Exhibit 3 here). As such, there is no logical reason why mutual funds should necessarily perform like the market as a group, and worse than the market after costs. In fact, a layman would be justified in expecting professional investors to do better, before and after costs, compared to e.g. households. Whether mutual funds do better than the market is therefore an empirical rather than a logical matter.

The question has a long history, dating back to Jensen (1968) all the way to the latest S&P SPIVA report. Most of these studies make it particularly hard for outperformance to show up. Rather than squarely comparing fund returns to the market index, they either adjust performance for ‘risk’ (Jensen) using the now abandoned CAPM model, or slice and dice fund returns (SPIVA), box them into a variety of categories and compare them to artificial sub-indices. As a result, the commonly held view – reflected in Buffett’s 2005 prediction – is that ‘most funds underperform the market’. From this, the allure of index investing is a small logical step and a seemingly impregnable conclusion. All you need to say is, as Buffett puts it (p. 24):

There are, of course, some skilled individuals who are highly likely to out-perform the S&P over long stretches. In my lifetime, though, I’ve identified – early on – only ten or so professionals that I expected would accomplish this feat.

There are no doubt many hundreds of people – perhaps thousands – whom I have never met and whose abilities would equal those of the people I’ve identified. The job, after all, is not impossible. The problem simply is that the great majority of managers who attempt to over-perform will fail. The probability is also very high that the person soliciting your funds will not be the exception who does well.

Further complicating the quest for worthy managers – says Buffett – is the fact that outperformance may well be the result of luck over short periods, and that It typically attracts a torrent of money, which the manager gladly accepts to his own benefit, thus making future returns more difficult to sustain.

The bottom line: When trillions of dollars are managed by Wall Streeters charging high fees, it will usually be the managers who reap outsized profits, not the clients. Both large and small investors should stick with low-cost index funds.

It was on this basis that Buffett followed his 2005 prophesy by offering a bet to any investment professional able to select at least five hedge funds that would match the performance of a Vanguard S&P500 index fund over the subsequent ten years. He called for hedge funds, which represent an even smaller portion of the US equity investor universe, as he considers them as the most strident example of divergence between bold return promises – reflected in hefty fees – and actual results. Most hedge funds do not set beating the S&P500 as their stated objective, preferring instead to target high returns independent of market conditions. But Buffett’s call was right: what’s the point of charging high fees if you can’t deliver more than index returns? At the same time, presumably he would not have objected to betting against long-only active funds explicitly managed to achieve S&P500 outperformance.

What followed – said Buffett – was the sound of silence. This is indeed surprising. Hedge fund managers’ objectives may be fuzzier, but if you manage a long-only US equity fund with a mandate to outperform the S&P500 and you genuinely believe you can do it, what better promotional opportunity is there than to bet against Warren Buffett and win?

Be as it may, only one manager took up the challenge. And – bless him – he did not choose five long-only funds, nor five hedge funds, but five funds of hedge funds: he picked five funds that picked more than 100 hedge funds that picked thousands of stocks. Nothing wrong with that, in principle. Presumably, each of the five funds of funds managers believed he could select a portfolio of hedge funds that, at least on average, would do so much better than the S&P500 that, despite the double fee layer, it would itself end up well ahead of the index. They were wrong, very wrong (p. 22). Over the nine years from 2008 to 2016, the S&P500 returned 85.4% (7.1% per annum). Only fund of funds C got somewhat close, with a return of 62.8% (5.6% per annum). The other four funds returned, in order: 28.3%, 8.7%, 7.5% and 2.9% (that is 2.8%, 0.9%, 0.8% and 0.3% per annum).

Result: Buffett’s valiant and solitary challenger, Mr. Ted Seides, co-manager, at the time, of Protégé Partners, played a very bad hand and made a fool of himself. But Buffett was lucky: he set out to prove ‘Bogle’s truth’ and observe index-like returns before fees, turning into underperformance after fees, but what he got was abysmal returns. Except perhaps for fund C, the gaping hole between the funds and the S&P500 had very little to do with fees. Buffett estimated that about 60% of all gains achieved by the five funds of funds went into the two fee layers. But even if fund D, returning a whopping 0.3% per year, had charged nothing, to select hedge funds that charged nothing, it would still have ended up well below the index. Same for funds A and E and, likely, for fund B.

To recap: when applied to mutual and hedge funds, ‘Bogle’s truth’ is not a logical necessity – as it is often portrayed to be – but is an empirical statement. Performance studies make it hard for outperformance to emerge, but beating the index in the long run is certainly no easy task, even for professional investors. Fees make it even harder – the higher the fees, the harder the task. However, while difficult to achieve and therefore rare to observe, long-term outperformance is not impossible – Buffett is the first to acknowledge it: he’s a living proof!

Why is it then that he interpreted his bet win against Seides as evidence of ‘Bogle’s truth’? Imagine he had called for five value stocks and got five duds. Would he have interpreted this as evidence of the impossibility of value investing? What’s the difference between picking stocks and picking funds? Why does Buffett consider the former a difficult but valiant endeavour while the latter an impossible waste of time?

More in the next post.

Print Friendly, PDF & Email
Apr 302017
 

One of the beauties of maths is that it is the same in every language. So you don’t need to know Italian to read the table on the second page of this article on this week’s Milano Finanza.

The Made in Italy Fund started in May last year and is up 43% since then.

Here are the main points of the article:

  1. The Italian stock market is dominated by the largest 40 stocks included in the FTSE MIB index. The FTSE MIB and the FTSE Italia All Shares indices are virtually overlapping (first graph on page 1).
  2. 2/3 of the Italian market is concentrated in 4 sectors.
  3. Small Caps – companies with a market cap of less than 1 billion euro – are 3/4 of the 320 quoted names, but represent only 6% of the value of the market.
  4. Small Caps as a whole have underperformed Large Caps (second graph).
  5. But quality Small Caps – those included in the Star segment of the market – have outclassed the MIB index (third graph).
  6. However, the Star index is itself concentrated (table on page 3): the top 11 stocks in the index with a market cap above 1 billion (not 12: Yoox is no longer there) represent more than 60% of the index value (a company needs to be below 1 billion to get into the Star segment, but it is not necessarily taken out when it goes above).
  7. Therefore, to invest in Italian Small Caps you need to know what you’re doing: you can’t just buy a Mid/Small Cap ETF – which is what a lot of people did in the first quarter of this year, after the launch of PIR accounts (similar to UK ISAs), taking the Lyxor FTSE Italia Mid Cap ETF from 42 to 469 million.

To this I would add: you can’t just buy a fund tracking the Star index either (there are a couple): to own a stock just because it is part of an index makes no sense – more on this in the next post.

Print Friendly, PDF & Email
Mar 192017
 

Fisher’s Bias – focusing on a low FPR without regard to TPR – is the mirror image of the Confirmation Bias – focusing on a high TPR without regard to FPR. They both neglect the fact that what matters is the ratio of the two – the Likelihood Ratio. As a result, they both give rise to major inferential pitfalls.

The Confirmation Bias explains weird beliefs – the ancient Greeks’ reliance on divination and the Aztecs’ gruesome propitiation rites, as well as present-day lunacies, like psychics and other fake experts, superstitions, conspiracy theories and suicide bombers, alternative medicine and why people drink liquor made by soaking a dried tiger penis, with testicles attached, into a bottle of French cognac.

Fisher’s Bias has no less deleterious consequences. FPR<5% hence PP>95%: ‘We have tested our theory and found it significant at the 5% level. Therefore, there is only a 5% probability that we are wrong.’ This is the source of a deep and far-reaching misunderstanding of the role, scope and goals of what we call science.

‘Science says that…’, ‘Scientific evidence shows that…’, ‘It has been scientifically proven that…’: the view behind these common expressions is of science as a repository of established certainties. Science is seen as the means for the discovery of conclusive evidence or, equivalently, the accumulation of overwhelmingly confirmative evidence that leaves ‘no room for doubt or opposition‘. This is a treacherous misconception. While truth is its ultimate goal, science is not the preserve of certainty but quite the opposite: it is the realm of uncertainty, and its ethos is to be entirely comfortable with it.

Fisher’s Bias sparks and propagates the misconception. Evidence can lead to certainty, but it often doesn’t: the tug of war between confirmative and disconfirmative evidence does not always have a winner. By equating ‘significance’ with ‘certainty beyond reasonable doubt’, Fisher’s Bias encourages a naïve trust in the power of science and a credulous attitude towards any claim that manages to be portrayed as ‘scientific’. In addition, once deflated by the reality of scientific controversy, such trust can turn into its opposite: a sceptical view of science as a confusing and unreliable enterprise, propounding similarly ‘significant’ but contrasting claims, all portrayed as highly probable, but in fact – as John Ioannidis crudely puts it – mostly false.

Was Ronald Fisher subject to Fisher’s Bias? Apparently not: he stressed that ‘the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis’, immediately adding that ‘if an experiment can disprove the hypothesis’ it does not mean that it is ‘able to prove the opposite hypothesis.’ (The Design of Experiments, p. 16). However, the reasoning behind such conclusion is typically awkward. The opposite hypothesis (in our words, the hypothesis of interest) cannot be tested because it is ‘inexact’ – remember in the tea-tasting experiment the hypothesis is that the lady has some unspecified level of discerning ability. But – says Fisher – even if we were to make it exact, e.g. by testing perfect ability, ‘it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation’ (ibid.). Notice the confusion: saying that FPR<5% disproves the null hypothesis but FPR>5% does not prove it, Fisher is using the word ‘prove’ in two different ways. By ‘disproving’ the null he means considering it unlikely enough, but not certainly false. By ‘proving’ it, however, he does not mean considering it likely enough – which would be the correct symmetrical meaning – but he means considering it certainly true. That’s why he says that the null hypothesis as well as the opposite hypothesis are never proved. But this is plainly wrong and misleading. Prove/disprove is the same as accept/reject: it is a binary decision – doing one means not doing the other. So disproving the null hypothesis does mean proving the opposite hypothesis – not in the sense that it is certainly true, but in the correct sense that it is likely enough.

Here then is Fisher’s mistake. If H is the hypothesis of interest and not H the null hypothesis, FPR=P(E|not H) – the probability of the evidence (e.g. a perfect choice in the tea-tasting experiment) given that the hypothesis of interest is false (i.e. the lady has no ability and her perfect choice is a chance event). Then saying that a low FPR disproves the null hypothesis is the same as saying that a low P(E|not H) means a low P(not H|E). But since P(not H|E)=1–P(H|E)=1–PP, then a low FPR means a high PP, as in: FPR<5% hence PP>95%.

Hence yes: Ronald Fisher was subject to Fisher’s Bias. Despite his guarded and ambiguous wording, he did implicitly believe that 5% significance means accepting the hypothesis of interest. We have seen why: prior indifference. Fisher would not contemplate any value of BR other than 50%, i.e. BO=1, hence PO=LR=TPR/FPR. Starting with prior indifference, all is needed for PP=1-FPR is error symmetry.

Fisher’s Bias gives rise to invalid inferences, misplaced expectations and wrong attitudes. By setting FPR in its proper context, our Power surface brings much needed clarity on the subject, including, as we have seen, Ioannidis’s brash claim. Let’s now take a closer look at it.

Remember Ioannidis’s main point: published research findings are skewed towards acceptance of the hypothesis of interest based on the 5% significance criterion. Fisher’s bias favours the publication of ‘significant’ yet unlikely research findings, while ‘insignificant’ results remain unpublished. As we have seen, however, this happens for a good reason: it is unrealistic to expect a balance, as neither researchers nor editors are interested in publishing rejections of unlikely hypotheses. What makes a research finding interesting is not whether it is true or false, but whether it confirms an unlikely hypothesis or disconfirms a likely one.

Take for instance Table 4 in Ioannidis’s paper (p. 0700), which shows nine examples of research claims as combinations of TPR, BO and PP, given FPR=5%. Remember the match between our and Ioannidis’s notation: FPR=α, TPR=1-β (FNR=β), BO=R and PP=PPV. For the moment, let’s just take the first two columns and leave the rest aside:

So for example the first claim has TPR=80%, hence LR=16 and, under prior indifference (BO=1, BR=50%), PO=16 and therefore PP=94.1%. In the second, we have TPR=95%, hence LR=19, BO=2 and BR=2/3, hence PO=38 and therefore PP=97.4%. And so on. As we can see, four claims have PP>50%: there is at least a preponderance of evidence that they are true. Indeed the first three claims are true even under a higher standard, with claim 2 in particular reaching beyond reasonable doubt, as it starts from an already high prior, which gets further increased by powerful confirmative evidence. In 3, powerful evidence manages to update a sceptical 25% prior to an 84% posterior, and in 6 to update an even more strongly sceptical prior to a posterior above 50%. The other five claims, on the other hand, have PP<50%: they are false even under the lowest standard of proof, with 8 and 9 in particular standing out as extremely unlikely. Notice however that in all nine cases we have LR>1: evidence is, in various degrees, confirmative, i.e. it increases prior odds to a higher level. Even in the last two cases, where evidence is not very powerful and BR is a tiny 1/1000 – just like in our child footballer story – LR=4 quadruples it to 1/250. The posterior is still very small – the claims remain very unlikely – but this is the crucial point: they are a bit less unlikely than before. That’s what makes a research finding interesting: not a high PP but a LR significantly different from 1. All nine claims in the table – true and false – are interesting and, as such, worth publication. This includes claim 2, where further confirmative evidence brings virtual certainty to an already strong consensus. But notice that in this case disconfirmative evidence, reducing prior odds and casting doubt on such consensus, would have attracted even more interest. Just as we should expect to see a preponderance of studies confirming unlikely hypotheses, we should expect to see the same imbalance in favour of studies disconfirming likely hypotheses. It is the scientific enterprise at work.

Let’s now look at Ioannidis’s auxiliary point: the preponderance of ‘significant’ findings is reinforced by a portion of studies where significance is obtained through data manipulation. He defines bias u as ‘the proportion of probed analyses that would not have been “research findings”, but nevertheless end up presented and reported as such, because of bias’ (p. 0700).

How does Ioannidis’s bias modify his main point? This is shown in the following table, where PP* coincides with PPV in his Table 4:

Priors are the same, but now bias u causes a substantial reduction in LR and therefore in PP. For instance, in the first case u=0.10 means that 10% of research findings supporting the claim have been doctored into 5% significance through some form of data tampering. As a result, LR is lowered from 16 to 5.7 and PP from 94.1% to 85%. So in this case the effect of bias is noticeable but not determinant. The same is true in the second case, where a stronger bias causes a big reduction in LR from 19 to 2.9, but again not enough to meaningfully alter the resulting PP. In the third case, however, an even stronger bias does the trick: it reduces LR from 16 to 2 and PP from 84.2% all the way down to 40.6%. While the real PP is below 50%, a 40% bias makes it appear well above: the claim looks true but is in fact false. Same for 6, while the other five claims, which would be false even without bias, are even more so with bias – their LR reduced to near 1 and their PP consequently remaining close to their low BR.

This sounds a bit confusing so let’s restate it, taking case 3 as an example. The claim starts with a 25% prior – it is not a well established claim and would therefore do well with some confirmative evidence. The appearing evidence is quite strong: FPR=5% and TPR=80%, giving LR=16, which elevates PP to 84.1%. But in reality the evidence is not as strong: 40% of the findings accepting the claim have been squeezed into 5% significance through data fiddling. Therefore the real LR – the one that would have emerged without data alterations – is much lower, and so is the real PP resulting from it: the claim appears true but is false. So is claim 6, thus bringing the total of false claims from five to seven – indeed most of them.

How does bias u alter LR? In Ioannidis’s model, it does so mainly by turning FPR into FPR*=FPR+(1-FPR)u – see Table 2 in the paper (p. 0697). FPR* is a positive linear function of u, with intercept FPR and slope 1-FPR, which, since FPR=5%, is a very steep 0.95. In case 3, for example, u produces a large increase of FPR from 5% to 43%. In addition, u turns TPR into TPR*=TPR+(1-TPR)u, which is also a positive linear function of u, with intercept TPR and slope 1-TPR which, since the TPR of confirmative evidence is higher than FPR, is flatter. In case 3 the slope is 0.2, so u increases TPR from 80% to 88%. The combined effect, as we have seen, is a much lower LR*=TPR*/FPR*, going down from 16 to 2.

I will post a separate note about this model, but the point here is that, while Ioannidis’s bias increases the proportion of false claims, it is not the main reason why most of them are false. Five of the nine claims in his Table 4 would be false even without bias.

In summary, by confusing significance with virtual certainty, Fisher’s Bias encourages Ioannidis’s bias (I write it with a small b because it has no cognitive value: it is just more or less intentional cheating). But Ioannidis’s bias does not explain ‘Why Most Research Findings Are False’. The main reason is that many of them test unlikely hypotheses, and therefore, unless they manage to present extraordinary or conclusive evidence, their PP turns out to be lower and often much lower than 50%. But this doesn’t make them worthless or unreliable – as the paper’s title obliquely suggests. As long as they are not cheating, researchers are doing their job: trying to confirm unlikely hypotheses. At the same time, however, they have another important responsibility: to warn the reader against Fisher’s Bias, by explicitly clarifying that, no matter how ‘significant’ and impressive their results may appear, they are not ‘scientific revelations’ but tentative discoveries in need of further evidence.

Print Friendly, PDF & Email
Feb 272017
 

Armed with our Power surface, let’s revisit John Ioannidis’s claim according to which ‘Most Published Research Findings Are False’.

Ioannidis’s target is the immeasurable confusion generated by the widespread mistake of interpreting Fisher’s 5% statistical significance as implying a high probability that the hypothesis of interest is true. FPR<5%, hence PP>95%. As we have seen, this is far from being the general case: TPR=PP if and only if BO=FPR/FNR, which under prior indifference requires error symmetry.

Fisher’s 5% significance is neither a sufficient not a necessary condition for accepting the hypothesis of interest. Besides FPR, acceptance and rejection depend on BR, PP and TPR. Given FPR=5%, all combinations of the three variables lying on or above the curved surface indicate acceptance. But, contrary to Fisher’s criterion, combinations below the surface indicate rejection. The same is true for values of FPR below 5%, which are even more ‘significant’ according to Fisher’s criterion. These widen the curved surface and shrink the roof, thus enlarging the scope for acceptance, but may still indicate rejection for low priors and high standards of proof, if TPR power is not or cannot be high enough. On the other hand, TPR values above 5%, which according to Fisher’s criterion are ‘not significant’ and therefore imply unqualified rejection, reduce the curved surface and expand the roof, thus enlarging the scope for rejection, but may still indicate acceptance for higher priors and lower standards of proof, provided TPR power is high enough. Here are pictures for FPR=2.5% and 10%:

So let’s go back to where we started. We want to know whether a certain claim is true or false. But now, rather than seeing it from the perspective of a statistician who wants to test the claim, let’s see it from the perspective of a layman who wants to know if the claim has been tested and whether the evidence has converged towards a consensus, one way or the other.

For example: ‘Is it possible to tell whether milk has been poured before or after hot water by just tasting a cup of tea?’ (bear with me please). We google the question and let’s imagine we get ten papers delving into this vital issue. The first and earliest is by none other than the illustrious Ronald Fisher, who performed it on the algologist Muriel Bristol and, on finding that she made no mistakes with 8 cups – an event that has only 1/70 probability of being the product of chance, i.e. a p-value of 1.4%, much lower than required by his significance criterion – concluded, against his initial scepticism, that ‘Yes, it is possible’. That’s it? Well, no. The second paper describes an identical test performed on the very same Ms Bristol three months later, where she made 1 mistake – an event that has a 17/70 probability of being a chance event, i.e. a p-value of 24.3%, much larger than the 5% limit allowed by Fisher’s criterion. Hence the author rejected Professor Fisher’s earlier claim about the lady’s tea-tasting ability. On to the third paper, where Ms Bristol was given 12 cups and made 1 mistake, an event with a 37/924=4% probability of being random, once again below Fisher’s significance criterion. And so on with the other papers, each one with its own set up, its own numbers and its own conclusions.

It is tempting at this point for the layman to throw up his arms in despair and execrate the so-called experts for being unable to give a uniform answer to such a simple question. But he would be entirely wrong. The evidential tug of war between confirmative and disconfirmative evidence is the very essence of science. It is up to us to update our prior beliefs through multiplicative accumulation of evidence and to accept or reject a claim according to our standard of proof.

If anything, the problem is the opposite: not too much disagreement but – and this is Ioannidis’s main point – too little. Evidence accumulation presupposes that we are able to collect the full spectrum, or at least a rich, unbiased sample of all the available evidence. But we hardly ever do. The evidence we see is what reaches publication, and publication is naturally skewed towards ‘significant’ findings. A researcher who is trying to prove a point will only seek publication if he thinks he has gathered enough evidence to support it. Who wants to publish a paper about a theory only to announce that he has got inadequate evidence for it? And even if he tried, what academic journal would publish it?

As a result, available evidence tends to be biased towards acceptance. And since acceptance is still widely based on Fisher’s criterion, most published papers present FPR<5%, while those with FPR>5% remain unpublished and unavailable. To add insult to injury, in order to reach publication some studies get squeezed into significance through more or less malevolent data manipulation. It is what Ioannidis calls bias: ‘the combination of various design, data, analysis, and presentation factors that tend to produce research findings when they should not be produced’. (p. 0697).

This dramatically alters the evidential tug of war. It is as if, when looking into the milk-tea question, we would only find Fisher’s paper and others accepting the lady’s ability – including some conveniently glossing over a mistake or two – and none on the side of rejection. We would then be inclined to conclude that the experts agree and would be tempted to go along with them – perhaps disseminating and reinforcing the bias through our own devices.

How big a problem is this? Ioannidis clearly thinks it is huge – hence the dramatic title of his paper, enough to despair not just about some experts but about the entire academic community and its scientific enterprise. Is it this bad? Are we really swimming in a sea of false claims?

Let’s take a better look. First, we need to specify what we mean by true and false. As we know, it depends on the standard of proof, which in turn depends on utility preferences. What is the required standard of proof for accepting a research finding as true? Going by the wrongful interpretation of Fisher’s 5% significance criterion, it is PP>95%. But this is not only a mistake: it is the premise behind an insidious misrepresentation of the very ethos of scientific research. Obviously, truth and certainty are the ultimate goals of any claim. But the value of a research finding is not in how close it is to the truth, but in how closer it gets us to the truth. In our framework, it is not in PP but in LR.

The goal of scientific research is finding and presenting evidence that confirms or disconfirms a specific hypothesis. How much more (or less) likely is the evidence if the hypothesis is true than if it is false? The value of evidence is in its distance from the unconfirmative middle point LR=1. A study is informative, hence worth publication, if the evidence it presents has a Likelihood Ratio significantly different from 1, and is therefore a valuable factor in the multiplicative accumulation of knowledge. But a high (or low) LR is not the same as a high (or low) PP. They only coincide under prior indifference, where, more precisely, PO=LR, i.e. PP=LR/(1+LR). So, for example, if LR=19 – the evidence is 19 times more likely if the hypothesis is true than if it is false – then PP=19/20=95%. But, as we know very well, prior indifference is not a given: it is a starting assumption which, depending on the circumstances, may or may not be valid. BO – Ioannidis calls it R, ‘the ratio of “true relationships” to “no relationships”‘ (p. 0696) – gives the pre-study odds of the investigated hypothesis being true. It can be high, if the hypothesis has been tested before and is already regarded as likely true, or low, if it is a novel hypothesis that has never been tested and, if true, would be an unexpected discovery. In the first case, LR>1 is a further confirmation that the hypothesis should be accepted as true – a useful but hardly noteworthy exercise that just reinforces what is already known. On the other hand, LR<1 is much more interesting, as it runs against the established consensus. LR could be as low as to convert a high BO into a low PO, thus rejecting a previously accepted hypothesis. But not necessarily: while lower than BO, PO could remain high, thus keeping the hypothesis true, while casting some doubts on it and prodding further investigation. In the second case, LR<1 is a further confirmation that the hypothesis should be rejected as false. On the other hand, LR>1 increases the probability of an unlikely hypothesis. It could be as high as to convert a low BO into a high PO, thus accepting what was previously an unfounded conjecture. But not necessarily: while higher than BO, PO could remain low, thus keeping the hypothesis false, but at the same time stimulating more research.

Such distinctions get lost in Ioannidis’s sweeping claim. True, SUTCing priors and neglecting a low BO can lead to mistakenly accepting hypotheses on the basis of evidence that, while confirmative, leaves PO well below any acceptance level. The mistake is exacerbated by Fisher’s Bias – confusing significance (a low FPR) with confirmation (a high PP) – and by Ioannidis’s bias – squeezing FPR below 5% through data alteration. FPR<5% does not mean PP>95% or even PP>50%. As shown in our Power surface, for any standard of proof, the lower is BO the higher is the required TPR for any level of FPR. Starting from a low BO, accepting the hypothesis requires very powerful evidence. Without it, acceptance is a false claim. Moreover, published acceptances – rightful or wrongful – are not adequately counterbalanced by rejections, which remain largely unpublished. This however occurs for an entirely legitimate reason: there is little interest in rejecting an already unlikely hypothesis. Interesting research is what runs counter prior consensus. Starting from a low BO, any confirmative evidence is interesting, even when it is not powerful enough to turn a low BO into a high PO. Making an unlikely hypothesis a bit less unlikely is interesting enough, and is worth publication. But – here Ioannidis is right – it should not be confused with acceptance. Likewise, there is little interest in confirmative evidence when BO is already high. What is interesting in this case is disconfirmative evidence, again even when it is not powerful enough to reject the hypothesis by turning a high BO into a low PO. Making a likely hypothesis a bit less likely is interesting enough. But it should not be confused with rejection.

Print Friendly, PDF & Email
Feb 172017
 

Back to PO=LR∙BO.

Whether we accept or reject a hypothesis, i.e. decide whether a claim is true or false, depends on all three elements.

Posterior Odds. The minimum standard of proof required to accept a hypothesis is PO>1 (i.e. PP>50%). We call it Preponderance of evidence. But, depending on the circumstances, this may not be enough. We have seen two other cases: Clear and convincing evidence: PO>3 (i.e. PP>75%), and Evidence beyond reasonable doubt: PO>19 (PP>95%), to which we can add Evidence beyond the shadow of a doubt: PO>99 (PP>99%) or even PO>999 (PP>99.9%). The spectrum is continuous, from 1 to infinity, where Certainty (PP=100%) is unattainable and is therefore a decision. The same is symmetrically true for rejecting the hypothesis, from PO<1 to the other side of Certainty: PO=PP=0.

Base Odds. To reach the required odds we have to start somewhere. A common starting point is Prior indifference, or Perfect ignorance: BO=1 (BR=50%). But, depending on the circumstances, this may not be a good starting place. With BO=1 it looks like Base Odds have disappeared, but they haven’t: they are just being ignored – which is never a good start. Like PO, Base Odds are on a continuous spectrum between the two boundaries of Faith: BR=100% and BR=BO=0. Depending on BO, we need more or less evidence in order to achieve our required PO.

Likelihood Ratio. Evidence is confirmative if LR>1, i.e. TPR>FPR, and disconfirmative if LR<1, i.e. TPR<FPR. The size of TPR and FPR are not relevant per se – what matters is their ratio. A high TPR means nothing without a correspondingly low FPR. Ignoring this leads to the Confirmation Bias. Likewise, a low FPR means nothing without a correspondingly high TPR. Ignoring this leads to Fisher’s Bias.

To test a hypothesis, we start with a BO level that best reflects our priors and set our required standard of proof PO. The ratio of PO to BO determines the required LR: the strength or weight of the evidence we demand to accept the hypothesis. In our tea-tasting story, for example, we have BO=1 (BR=50%) and PO>19 (PP>95%), giving LR>19: in order to accept the hypothesis that the lady has some tea-tasting ability, we require evidence that is at least 19 times more likely if the hypothesis is true than if the hypothesis is false. A test is designed to calculate FPR: the probability that the evidence is a product of chance. This requires defining a random variable and assigning to it a probability distribution. Our example is an instance of what is known as Fisher’s exact test, where the random variable is the number of successes over the number of trials without replacement, as described by the hypergeometric distribution. Remember that with 8 trials the probability of a perfect choice under the null hypothesis of no ability is 1/70, the probability of 3 successes and 1 failure is 16/70, and so on. Hence, in case of a perfect choice we accept the hypothesis that the lady has some ability if TPR>19∙(1/70)=27% – a very reasonable requirement. But with 3 successes and 1 failure we would require an impossible TPR>19∙(17/70). On the other hand, if we lower our required PO to 3 (PP>75%), then all we need is TPR>3∙(17/70)=73% – a high but feasible requirement. But if we lower our BO to a more sceptical level, e.g. BO=1/3 (BR=25%), then TPR>3∙3∙(17/70) is again too high, whereas a perfect choice may still be acceptable evidence of some ability, even with the higher PO: TPR>3∙19∙(1/70)=81%.

So there are four variables: PP, BR, FPR and TPR. Of these, PP is set by our required standard of proof, BR by our prior beliefs and FPR by the probability distribution of the relevant random variable. These three combined give us the minimum level of TPR required to accept the hypothesis of interest. TPR – the probability of the evidence in case the hypothesis is true – is also known as statistical power, or sensitivity. Our question is: given our starting priors and required standard of proof, and given the probability that the evidence is a chance event, how powerful should the evidence be for us to decide that it is not a chance event but a sign that the hypothesis of interest is true?

Clearly, the lower is FPR, the more inclined we are to accept. As we know, Fisher would do it if FPR<5% – awkwardly preferring to declare himself able to disprove the null hypothesis at such significance level. That was enough for him: he apparently took no notice of the other three variables. But, as we have seen, what he might have been doing was implicitly assuming error symmetry, prior indifference and setting PP beyond reasonable doubt, thus requiring TPR>95%, i.e. LR>19. Or, more likely, at least in the tea test, he was starting from a more sceptical prior (e.g. 25%), while at the same time lowering his standard of proof to e.g. 75%, which at FPR=5% requires TPR>45%, i.e. LR>9, or perhaps to 85%, which requires TPR>85%, i.e. LR>17.

There are many combinations of the four variables that are consistent with the acceptance of the hypothesis of interest. To see it graphically, let’s fix FPR: imagine we have just ran a test and calculated that the probability that the resulting evidence is the product of chance is 5%. Do we accept the hypothesis? Yes, says Fisher, But we say: it depends on our priors and standard of proof. Here is the picture:

For each BR, TPR is a positively convex function of PP. For example, with prior indifference (BR=50%) and a minimum standard of proof (PP>50%) all we need to accept the hypothesis is TPR>5% (i.e. LR>1): the hypothesis is more likely to be true than false. But with a higher standard, e.g. PP>75%, we require TPR>15% (LR>3), and with PP>95% we need TPR>95% (LR>19). The requirement gets steeper with a sceptical prior. For instance, halving BR to 25% we need TPR>15% for a minimum standard and TPR>45% for PP>75%. But PP>95% would require TPR>1: evidence is not powerful enough for us to accept the hypothesis beyond reasonable doubt. For each BR, the maximum standard of proof that keeps TPR below 1 is BO/(BO+FPR). Under prior indifference, that is 95% (95.24% to be precise: PO=20), but with BR=25% it is 87%. The flat roof area in the figure indicates the combination of priors and standards of proof which is incompatible with accepting the hypothesis at the 5% FPR level.

If TPR is on or above the curved surface, we accept the hypothesis. But, unlike Fisher, if it is below we reject it: despite a 5% FPR, the evidence is not powerful enough for our priors and standard of proof. Remember we don’t need to calculate TPR precisely. If, as in the tea-tasting story, the hypothesis of interest is vague – the lady has some unspecified ability – it might not be possible. But what we can do is to assess whether TPR is above or below the required level. If we are prior indifferent and want to be certain beyond reasonable that the hypothesis is true, we need TPR>95%. But if we are happy with a lower 75% standard then all we need is TPR>15%. If on the other hand we have a sceptical 25% prior, there is no way we can be certain beyond reasonable doubt, while with a 75% standard we require TPR>45%.

It makes no sense to talk about significance, acceptance and rejection without first specifying priors and standards of proof. In particular, very low priors and very high standards land us on the flat roof corner of the surface, making it impossible for us to accept the hypothesis. This may be just fine – there many hypotheses that I am too sceptical and too demanding to be able to accept. At the same time, however, I want to keep an open mind. But doing so does not mean reneging my scepticism or compromising my standards of proof. It means looking for more compelling evidence with a lower FPR. That’s what we did when the lady made one mistake with 8 cups. We extended the trial to 12 cups and, under prior indifference and with no additional mistakes, we accepted her ability beyond reasonable doubt. Whereas, starting with sceptical priors, acceptance required lowering the standard of proof to 75% or extending the trial to 14 cups.

To reduce the size of the roof we need evidence that is less likely to be a chance event. For instance, FPR=1% shrinks the roof to a small corner, where even a sceptical 25% prior allows acceptance up to PP=97%. In the limit, as FPR tends to zero – there is no chance that the evidence is a random event – we have to accept. Think of the lady gulping 100 cups in a row and spotlessly sorting them in the two camps: even the most sceptical statistician would have no shadow of a doubt to regard this as conclusive positive evidence (a Smoking Gun). On the other hand, coarser evidence, more probably compatible with a chance event, enlarges the roof, thus making acceptance harder. With FPR=10%, for instance, the maximum standard of proof under prior indifference is 91%, meaning that even the most powerful evidence (TPR=1) is not enough for us to accept the hypothesis beyond reasonable doubt. And with a sceptical 25% prior the limit is 77%, barely above the ‘clear and convincing’ level. While harder, however, acceptance is not ruled out, as it would be with Fisher’s 5% criterion. According to Fisher, a 1 in 10 probability that the evidence is the product of chance is just too high for comfort, making him unable to reject the null hypothesis. But what if TPR is very high, say 90%? In that case, LR=9: the evidence is 9 times more likely if the hypothesis is true than if it false and, under prior indifference, PP=90%. Sure, we can’t be certain beyond reasonable doubt that the hypothesis is true, but in many circumstances it would be eminently reasonable to accept it.

Print Friendly, PDF & Email
Jan 212017
 

Sorry we’re very late, but anyway, HAPPY NEW YEAR.

One reason is that once again the video was blocked on a copyright issue. Sony Music, owner of Uptown Funk, had no problems: their policy is to monetise the song. But Universal Music Group, owner of Here Comes the Sun, just blocks all Beatles songs – crazy. I am disputing their claim – it is just the first 56 seconds and we’re singing along, for funk sake! They will give me a response by 17 February.

Talking about Funk, Bruno Mars says the word 29 times. Lorenzo assures me there is an n in each one of them, but I have my doubts. (By the way, check out Musixmatch, a great Italian company).

What does Funk mean anyway? It’s another great kaleidoscopic expression.

I am putting the video online against my brother’s advice – “It’s bellissimo for the family, but what do others care?” I say fuggettaboutit, it’s funk.

Update. I got the response: “Your dispute wasn’t approved. The claimant has reviewed their claim and has confirmed it was valid. You may be able to appeal this decision, but if the claimant disagrees with your appeal, you could end up with a strike on your account.” Silly.

So I put the video on Vimeo. No problem there. Crazy.

Print Friendly, PDF & Email
Dec 262016
 

This may be obvious to some of you, so I apologise. But it wasn’t to me, and since it has been such a great discovery I’d like to offer it as my Boxing Day present.

If you are my age, or even a bit younger, chance is that you have a decent/good/awesome hifi stereo system, with big, booming speakers and all the rest of it. Mine is almost 20 years old, and it works great. But until recently I’d been using it just to listen to my CDs. All the iTunes and iPhone stuff – including dozens of CDs that I imported in it – was, in my mind, separate: I would only listen to it through the PC speakers. Good, but not great. I wished there was a way to do it on the hifi. Of course, I could have bought better PC speakers, or a new whiz-bang fully-connected stereo system. But I couldn’t be bothered – I just wanted to use my old one.

The PC and the hifi are in different rooms, so connecting them doesn’t work, even wirelessly. But after some thinking and research I found a perfect solution. So here it is.

All I needed was this:

It’s a tiny 5×5 cm Bluetooth Audio Adapter, made by Logitech. You can buy it on the Logitech website or on Amazon, currently at £26. All you need to do is connect it to the stereo and pair your iPhone or iPad (or equivalent) to it.

To complete, I also bought this:

It is a 12×10 cm 1-Output 2-Port Audio Switch, also available on Amazon, currently at £7. Here is the back:

So you connect:

  1. The Audio Adapter to the first Input of the switch (the Adapter comes with its own RCA cable).
  2. The Output of your system’s CD player to the second Input of the switch (you need another RCA cable).
  3. The CD Input of your system’s amplifier to the Output of the switch (you need a third RCA cable).

And you’re done! Now if you want to listen to your iPhone stuff you press button 1 on the front of the switch. If you want to listen to your CDs you press button 2.

And of course – but this you should already know – you can also listen to the tons of music on display on any of the streaming apps: Amazon Music, Google Play Music, Groove, Spotify, Deezer etc.

There you go. You’ve just saved hundreds or thousands of your favourite currency that an upgrade would have cost you. And you’re using your good old system to the full.

Merry Christmas.

Print Friendly, PDF & Email