How bad statistical practices drive away the good ones

Image for post

I am interested in research integrity and reproducibility. I believe that a lack of statistical expertise throughout the sciences is a substantial driver of problems in these areas (poor data practices being another). I feel especially strongly about this thesis as it applies to medical research.

I recently wrote something amateurish about the history of medical statistics to try and understand when and why this expertise was drained out of medical research. Instead I found (I think) that there was never a sufficient amount of statistical expertise in medical research, since the frequentist revolution in statistics happened so quickly that medical research as a field was never able to catch up. Clinical researchers were suddenly being asked to include statistical analyses of study data in their research reports, but the job of “applied-statistician” didn’t actually exist yet. So we spent much of the twentieth century convincing ourselves that expert statistical support wasn’t that important anyway, a fiction that was buoyed by our tendency to incentivize quantity over quality in science, and here we are.

Now I want to start exploring some of the consequences of this deficit. The first problem I want to discuss causes personal frustrations in my day-to-day job, so apologies for any perceived whining (and please do keep your tiny violin safe in its case). But it’s a problem you should find important too (at least if you are interested conducting high quality medical research that benefits patients and the public) because I am going to try and make the case that poor statistical practices don’t simply displace good practices, but actually create substantial barriers to doing things the right way, even when you have the good sense and opportunity to do so.

High quality clinical trials demand healthy collaborations between clinical investigators and statisticians, as each role brings unique, vital expertise to bear on the study’s design. The clinical investigator is the subject-matter expert, contributing their expertise on the clinical problem at hand. The statistician’s role, on the other hand, is ultimately about how to draw appropriate inferences from the results of a given trial (in light of other evidence), and how to design trials that best support these efforts. And while there is no doubt that good clinical investigators will acquire some of the statistician’s expertise, just as any respectable statistician will eventually become quite knowledgeable about the clinical areas they work in, I think most researchers would agree in principle on the importance of both roles.

However, in practice, it seems to be widely accepted that the expertise of the statistician can largely be replaced by the efforts of the investigator. This idea starts with their training. On the one hand, I was never taught anything that made me feel remotely qualified to practice medicine. This is despite being pre-med in undergrad (where comparative vertebrate anatomy kindly nudged me another direction), two post-grad degrees from top schools of public health, and more than a bit of experience breaking the news of a poor prognosis. Combined with my medical research experience, I surely know more about medicine than most of the other humans, but again, I have precisely zero illusions about my ability, or lack thereof, to practice medicine. The very notion of it is obscene.

The flip side of this is that just about every clinician will have received some statistical training. Some of the clinicians who go on to more research intensive careers will have had a bit more. However, the statistical training that clinicians receive, even for those who do PhDs, is often quite limited. For most, it is appropriately aimed at helping them read and critique published literature. For those who receive additional training, it typically focuses on a relatively simple tool kit of statistical tests to be applied situationally; it often lacks foundational training in the different philosophies underlying statistical inference; and it ignores common challenges of dealing with data “in the wild” such as multiple outcomes, measurement error, or missing data. However, the largest limitation of this training is that it often neglects to let the student know just how incomplete it is. Many clinical researchers are even taught that you only need a “real statistician” for the most unusual of problems.

This tendency to view statisticians as a kind of luxury, rather than a necessity, is then reinforced by their rarity in many clinical research environments, especially those outside of the top research hospitals and well-funded clinical trials. This means that clinical investigators must often “do their own statistics”, or perhaps enlist a more junior colleague to do so on their behalf. Many of these researchers are uncomfortable with this, to their credit, but they are often given no other options. So being the intelligent, motivated people that they tend to be, many clinical investigators will do this “successfully”, particularly if success is measured in published papers and posters presented.

One might hope that our current system of publication would then catch any statistical errors and fallacious thinking in the review process, but this is a childish fantasy. Statistical review at medical journals is uneven at best, and often performed, if at all, by non-experts. Thus statistical errors and clumsy practices are routinely published in the medical literature. These then get picked up by other researchers and repeated, and some even become accepted practice. This of course leads us to Brandolini’s law — the amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.

A good example of this is the concept of post-hoc power, an idea which seems to have some appeal among researchers (based on the number of times it is invoked in scientific publications), but that almost universally derided by statisticians. This division was recently epitomized by the publication of two reviews (1,2), both by the same team of surgeons, who framed their discussion of underpowered surgical trials around the concept of post-hoc power.

Since others have already written so much on this topic, I will be brief here. Post-hoc power is the idea that we can learn something useful by taking the observed effect from a given study, and then asking what that study’s power was to detect that effect. It is trivial to show that such a power calculation is a monotonic transformation of the p-value associated with that observed effect. In other words, nothing new can be learned from this particular power calculation. If you find a large p-value for the effect you observed, then you weren’t well powered to detect it. It’s nothing more complicated than that. Following from this, if you reviewed a number of studies, but restricted your review to those where the effects had a p-value greater than 0.05, you would certainly get a distorted view of how many published studies were well-powered to detect minimally important effects. But this is exactly what the two reviews noted above did.

A number of experienced statisticians wrote responses to these reviews to point this out, some even calling for retraction, given that the entire rationale of the papers is based on this fatally flawed thinking. However, the journals where the reviews were published seemingly view this as a matter of debate — that there are two sides, that both deserve to be heard, and that the reader is better off by hearing their respective arguments. And voilà, Brandolini’s Law. No amount of letters from, or consensus among, the statisticians can refute this nonsense. We can talk until we are blue in the face. And others will find the post-hoc power papers and cite them when it supports their purpose, which I can pretty much promise will often be to justify their own pointless, poorly designed studies. Of course if the tables were reversed, and I was publishing papers in statistics journals about my misguided opinions on surgical techniques…I can’t actually imagine what that would lead to, it’s so ludicrous. But here we are.

So why am I belly-aching about this? Am I just upset that a bunch of knuckleheaded surgeons were able to publish some nonsense (twice) against the sage advice of the wise, noble statisticians? No, I’m afraid it actually goes beyond this.

This fetishizing of debate is often justified with reference to some Bayesian flavored notion of how humans learn. The idea is that you start with your prior beliefs, and then, upon confronting data that conflicts with your previous state of knowledge, you adjust your beliefs to resolve the discrepancy. This is of course analogous to the Bayesian perspective on statistical inference, whereby a prior belief in some proposition can be represented by probability distribution, that is then multiplied by the likelihood of some observed data, and then normalized to result in a posterior probability distribution.

If human beings really do reason this way (I’m not so sure, but I’m no expert), then it would seemingly suggest that as two people with differing initial opinions are exposed to the same new data, their positions on the topic at hand should become more similar, and eventually converge given enough data. However, we also know that people are more than capable of holding strong opinions even in the face of strong evidence to the contrary. Does this undermine this notion that we are inherently Bayesian? Not exactly. One explanation was given by the mathematical physicist (and rabid Bayesian partisan) ET Jaynes in one of my favorite books, Probability Theory: The Logic of Science (3; there is a full pdf of this floating around online by the way). There he notes that when a person is confronted with data that conflicts with their prior, updating their prior isn’t the only option on the table. They might also downgrade how trustworthy they find the source of the data. After all, wouldn’t it be perfectly reasonable to raise an eyebrow to someone presenting data that are wildly out of line with your existing beliefs?

Returning then to this “debate” over post-hoc power, perhaps you’ll now agree that it’s not quite as simple as laying all the cards on the table and expecting everyone to converge on a now obvious truth. In fact, it’s quite the opposite. While some of those without strong opinions might learn a bit more about post-hoc power and why it doesn’t make sense, those who are already convinced of its validity won’t likely change, and might even question the expertise of the statisticians (gasp!).

So the problem isn’t simply about the time wasted refuting bullshit — it’s about degrading the trust you have in an already insufficient amount of statistical expertise (relative to needs). So when a statistician points out to a clinical colleague a statistical error that has been published by “top people in a top journal” they risk not being taken seriously. They risk not being included in the next project. They risk one investigator telling another, “I wouldn’t work with them…they don’t really seem to know what they are doing.”

These “opportunities” to lose trust are legion due to the sheer number of common statistical malpractices that seem impervious to any degree of correction. My short list of common sins in published clinical trials alone includes misinterpreted p-values, automated model selection, use of change scores, “table 1 tests” and subsequent covariate adjustment based on them, over-interpreting subgroup differences, and needless categorization. Then there are the more creative errors, like thinking that exponentiation of the coefficient from a linear regression with a log-transformed outcome results in an odds-ratio. Every time I have to have a conversation to advise a clinical colleague against one of these practices, I feel like I am fighting against hordes of established experts who have repeatedly published studies using these methods, screaming “What does he know!? I have 5 trials published in NEJM!”

To be clear, I’ve thus far been blessed to work with researchers that by and large allow and encourage me to contribute my expertise to their research. But the few that aren’t like this are very challenging to deal with. Then you start adding in the seemingly endless flow of asinine comments about statistics from reviewers (on grants and papers), and it starts to wear you down. Every wrong-headed suggestion means that I might need to first convince my collaborators that the suggestion is baseless, and then we have to start the dance of side-stepping their bad advice while still somehow satisfying the reviewer. It’s a constant slog. It would be so much easier to just do things the wrong way, or hide in a hole writing methods papers, leaving one fewer applied-statistician, dejected and downtrodden, to “t and p” your data.

…you can take out that violin now…

1. Bababekov, Y. J. et al. Is the Power Threshold of 0.8 Applicable to Surgical Science? — Empowering the Underpowered Study. J. Surg. Res. 241, 235–239 (2019).

2. Bababekov, Y. J., Stapleton, S. M., Mueller, J. L., Fong, Z. V. & Chang, D. C. A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science: Ann. Surg. 267, 621–622 (2018).

3. Jaynes, E. T. & Bretthorst, G. L. Probability theory: the logic of science. (Cambridge University Press, 2003).