A brief history of medical statistics and its impact on reproducibility.

You have to admit that medical research is a strange bird. There are few industries where we take a group of highly-trained, specialized practitioners that already shoulder a great deal of responsibility, and then ask them not only to contribute to research, but also demand that they lead it. Unfortunately, this odd situation has grim consequences for medical research, negatively impacting prospects for patients and the public’s health. I’ll come back to this point shortly, but first, my favorite joke of all time:

Q: What’s the difference between agricultural and medical research?

A: The former isn’t conducted by farmers (credit to Guernsey McPearson)

Feel free to let that sink in.

If you don’t see the “humor” in this, then you might avert your eyes as I share the quote it’s based on, from Michael Healy’s Fisher Memorial Lecture (1995) on the life of Frank Yates (1):

One thing to be taken into account is that Yates was throughout his peace-time career an agricultural scientist. To some it may seem that agricultural science is something of an oxymoron, but this is far from the truth. Bawden, my then Director, asked me when I left Rothamsted ‘why was I giving up a decent area agriculture for a scientific backwater like medicine?’ (as a virologist, he may have been prejudiced). In fact, as I learned in due course, clinical research is a largely amateur pursuit done by doctors, while agricultural research is done by professional scientists, not farmers.

Before any one gets agitated, this quote refers to medical research in the mid-twentieth century, not all that long after the publication of the MRC’s famous trial of streptomycin (2). The modern practice of medical research was in its infancy, to say the least.

Fast forward to today, and we see that medical research is now highly professionalized. The public and private money spent globally on medical research is substantial (about 40 billion a year from the NIH in the US alone), and the benefits of research for both future and current patient outcomes are widely appreciated. There is greater regulation of clinical research, a more developed research support infrastructure, and many more research training opportunities for clinicians. However, there is an important aspect of clinical research that remains underdeveloped, perhaps even amateurish, which is how we deploy statistical expertise.

A bit of history

At some point the clinical research community broadly accepted that any reasonably bright person with a bit of training could “do the stats.” Below we’ll discuss the consequences of this, but I want to think a bit about how we got there in the first place (To be clear, this is just a crude approximation, at best, of how this might have played out. It’s influenced by things I’ve happened to read over time — not from any directed research on my part. If you find the history of statistics interesting, I can recommend Stephen Senn’s Dicing with Death (3), and pretty much anything you can find by Stephen Stigler).

In the beginning there was probability theory, which is a set of mathematical rules for manipulating a special kind of number where zero equals impossible, and one equals absolute certainty. And while the theorems that govern these probabilities were derived by multiple people, coming from multiple perspectives, they all wound up in the same place.

One use of probability theory is to solve clever problems and brain teasers like the birthday or Monty Hall problems. These are problems where you are given a scenario and expected to work out the result. Solving this kind of problem has been referred to as an application of direct probability, and can be denoted as P(data|ϴ), which is the probability of my data (the result I can see) given ϴ (theta), where ϴ is one or more parameters of a probability distribution (n candidate explanation for how the data were generated). For example, given a fair coin for which the probability of heads is 0.5 (this is ϴ), what is the probability of getting 8 heads if you flip it 10 times?

There is a more important application of probability theory however, which is the opposite problem: You are given the result, and asked to make inferences about the possible scenario(s) that led there. In other words, you want to solve P(ϴ|data). For example, you’ve flipped a coin 10 times and observed 8 heads, and you want to make some statement about whether the coin is fair, or estimate what the “true” probability of heads is. This is usually what we are trying to do in medical research, and is more challenging, since it an epistemological question, not just a mathematical one.

This problem was originally a matter of inverse probability, and was the purview of the early Bayesians, who used Bayes’ theorem to turn likelihood functions into P(ϴ|data) by multiplying the former by a prior probability distribution of the possible ϴ. Lots of experts have written excellent papers outlining the Bayesian perspective for novices, so I won’t linger here long. The important thing to understand for our purposes is that prior to the computer age, you had to be a serious mathematician to do a proper Bayesian calculation, and even then they were largely limited to certain combinations of likelihoods and prior distributions (see conjugacy). My point here is that you couldn’t even fake it by shoving your data into SPSS and pushing random buttons to get “a result”. Without a computer, faking a Bayesian calculation would have been the equivalent of writing random symbols on a piece of paper.

Things started changing in the early 20th century though. In a nutshell, the frequentists arrived on the scene, turned the problem on its ear, started working out the sampling distributions for useful test-statistics, and making the case for their use in inference and decision making (e.g. p-values, confidence intervals, and error control). And while it took very serious mathematicians to work those sampling distributions out, once they were done, they could be printed as probability tables in big books that mere mortals such as you or I could pull off the shelf and compare our observed data against to see how they fared in light of the now famous “null hypothesis.” As more of these sampling distributions were worked out, for more and more estimators, under more and more sets of assumptions about how the data might have been generated, they were given names (like a paired t-test, or a chi-squared test) and packaged together into books, the most famous of which was RA Fisher’s Statistical Methods for Research Workers, first published in 1925 (14 editions, the last in 1970).

Now I am not exactly sure when it became normal for clinical researchers to do their own statistical tests, though undoubtedly the development of frequentist statistics, epitomized by Fisher’s famous text, played an important role in bringing that about. But in 1929, just a few years after Fisher’s famous book, H. L. Dunn reviewed 200 research papers in medicine and physiology and concluded that 90% should have used statistical methods but didn’t (4,5). Just three years after that, Major Greenwood was pointing out that “medical papers now frequently contain statistical analyses, and sometimes these analyses are correct, but the writers violate quite as often as before, the fundamental principles of statistical or of general logical reasoning” (6). Similarly, Bradford Hill (a former student of Greenwood’s) published Principles of Medical Statistics in 1937, in which he noted (7,8):

The worker in medical problems, in the field of clinical as well as preventive medicine, must himself know something of statistical technique, both in experimental arrangements and in the interpretation of figures. To enable him to acquire some knowledge of this technique I have tried to set down as simply as possible the statistical methods that experience has shown me to be most helpful in the problems with which medical workers are concerned

So over a fairly short period of time, statistical analyses went from being viewed as impractical and unnecessary in medical research, to being critically important. Statisticians were similarly transformed from “triflers” into “patentees for more or less powerful magic” (6). And who was there to meet the demand for these now necessary statistical tests and analyses? Remember, the “new” statistics of frequentism and modern experimental design had only just been developed, and all the professional statisticians were rare enough to still be viewed as magic-users. Thus “doing the stats” would largely have to fall to the clinical researchers themselves. Fast forward two decades, and Donald Mainland (9) was worried about “enthusiast amateurs” applying statistical tests (10,11) and publishing a long series called “Statistical ward rounds” in Clinical Pharmacology and Therapeutics to try and improve the now common state of poor statistical practice in medical research (12).

So how did it all work out? Were the early medical statisticians like Greenwood, Hill and Mainland able to stem the tide of shoddy statistical analyses with books and journal articles? Were we able to turn clinical researchers with little formal training in statistics (13), bright people who were already shouldering the enormous responsibility for providing care, into their own statisticians? Here is Doug Altman on the matter in 1994 (14):

What should we think about a doctor who uses the wrong treatment, either wilfully or through ignorance, or who uses the right treatment wrongly (such as by giving the wrong dose of a drug)? Most people would agree that such behaviour was unprofessional, arguably unethical, and certainly unacceptable.

What, then, should we think about researchers who use the wrong techniques (either wilfully or in ignorance), use the right techniques wrongly, misinterpret their results, report their results selectively, cite the literature selectively, and draw unjustified conclusions? We should be appalled. Yet numerous studies of the medical literature, in both general and specialist journals, have shown that all of the above phenomena are common. This is surely a scandal.

A resounding no. By failing to address the structurally driven deficits in scientific and statistical thinking already apparent decades earlier, and then amplifying them with professional incentives to produce more and more research (grossly mismeasured by papers published), medical research had become a scandal. Altman concluded that “We need less research, better research, and research done for the right reasons.” This is, in my opinion, the most important sentence ever written about medical research, but it fell on deaf ears. How do I know this? A decade or so later, the Lancet published a special issue on research waste, which made the case that medical research costing billions of dollars each year is wasted due to things like poor research questions, flawed study designs, erroneous statistical analyses, and clumsy research reports (15).

Just a few years ago, the editor of that same Lancet seemed to summarize these problems as science taking “a turn towards darkness” (16). He had just attended a symposium on the reproducibility (17) and reliability of biomedical research, a topic that tends to be presented as a recent or modern concern. But it isn’t. Here is Horton’s full quote:

The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.

As you can see, it’s all about misunderstanding statistical concepts, statistical methods being misapplied, or perhaps the most important thing of all, a lack of statistical thinking, the value of which, more than anything else, is to stop us from fooling ourselves and each other. There was never a turn towards darkness. This is the way it’s always been in medical research. The question is when we will decide to turn towards the light.

To be continued…


1. Healy, M. J. R. Frank Yates, 1902–1994 — The Work of a Statistician*. Int. Stat. Rev. 63, 271–288 (1995).

2. Crofton, J. The MRC randomized trial of streptomycin and its legacy: a view from the clinical front line. J. R. Soc. Med. 99, 531–534 (2006).

3. Senn, S. Dicing with death: chance, risk, and health. (Cambridge University Press, 2003).

4. Dunn, H. L. APPLICATION OF STATISTICAL METHODS IN PHYSIOLOGY. Physiol. Rev. 9, 275–398 (1929).

5. Mainland, D. The rise of experimental statistics and the problems of a medical statistician. Yale J. Biol. Med. 27, 1–10 (1954).

6. Greenwood. WHAT IS WRONG WITH THE MEDICAL CURRICULUM ? The Lancet 219, 1269–1270 (1932).

7. Hill, B. Principles of medical statistics. (Lancet, 1937).

8. V Farewell & Johnson, A. The origins of Austin Bradford Hill’s classic textbook of medical statistics. JLL Bulletin: Commentaries on the history of treatment evaluation. (2011).

9. Altman, D. Donald Mainland: anatomist, educator, thinker, medical statistician, trialist, rheumatologist. J. R. Soc. Med. 113, 28–38 (2020).

10. Mainland, D. The use and misuse of statistics in medical publications. Clin. Pharmacol. Ther. 1, 411–422 (1960).

11. Cromie, Brian W. THE FEET OF CLAY OF THE DOUBLE-BLIND TRIAL. The Lancet 282, 994–997 (1963).

12. Mainland, D. Statistical ward rounds — 1. Clin. Pharmacol. Ther. 8, 139–146 (1967).

13. Windish, D. M., Huot, S. J. & Green, M. L. Medicine Residents’ Understanding of the Biostatistics and Results in the Medical Literature. JAMA 298, 1010 (2007).

14. Altman, D. G. The scandal of poor medical research. BMJ 308, 283–284 (1994).

15. Ioannidis, J. P. A. et al. Increasing value and reducing waste in research design, conduct, and analysis. The Lancet 383, 166–175 (2014).

16. Horton, R. Offline: What is medicine’s 5 sigma? The Lancet 385, 1380 (2015).

17. Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1, 0021 (2017).