I’m not a real statistician, and you can be one too

Nov 30, 2022

Cara and I were hand-in-hand, slowly strolling around Cork on a bright, summer evening. It was the finest place on earth. I was in a suit and tie, but untucked, unbuttoned and loose. Except for my dress shoes, two sizes too small because they were the only ones I could find at the shops that morning after failing to find my own. I nervously chattered away, while she graciously listened, as we waited to find out whether I got the job.

The phone finally buzzed. I tried to answer confidently, while Cara subtly held her breath. Joe kindly let us off the hook immediately. “Congratulations Darren." I could hear the smile in his deep voice, but it couldn’t have been as big as mine. It was a huge promotion. Senior Lecturer in Biostatistics. We did it. Hugs. Drinks. Celebrations.

A few days later I was politely informed that no Darren, you actually can’t be a Senior Lecturer in Biostatistics. Silly goose. You are going to need to re-apply and re-interview for a job called Senior Lecturer in Patient-Focused Research Methodologies [sic]. Oh yes, it’s the exact same job. It’s just that we can’t let you call yourself a Senior Lecturer in Biostatistics because you are not, in fact, a “real” statistician.

So following my initial, profanity-tinged reaction, I rolled my eyes, swallowed my pride, and re-applied. Thankfully no super-star statisticians, real or otherwise, applied this time either, and I got the job. Again. Once the contract was signed, I declared myself Principal STATISTICIAN of the HRB Clinical Research Facility at University College Cork, and quietly vowed to take my revenge…

My quest for vengeance has of course subsided. Looking back, I still think the whole thing was silly, and I really was pretty mad about the "not a real statistician" thing. But now? Not only have I embraced my existence as a lowly applied statistician, I’m proud of it. And I want to help you become one too.

At its core, statistics is about evaluating the uncertainty of estimates and how that uncertainty impacts inferences and decision making. Viewed more broadly, statistics is about making sense of data without fooling ourselves in the process. It is fundamental to many areas of science, including medicine and health, which are the fields I work in and the main audience I am speaking to. However, despite the fundamental importance of statistics, we tend to treat it like specialist knowledge that only a select few are truly capable of wielding correctly. Of course most of us receive some training in statistics, but it’s often so shallow it does more harm than good - a little knowledge can be dangerous if it obscures your own ignorance. Thus, in practice, we wind up with lots of people knowing little about statistics, and few people, the “real” statisticians, who seem to know all of it. This is a problem.

Thankfully I think many of us eventually re-discover our own ignorance of statistics. I know I did. But in the meantime we are still heavily incentivized to produce research papers in bulk, many of which would benefit from statistical expertise. So we try to get the help we need, hopefully through collaboration or consultation with one of those “real” statisticians, but frequently fail because there aren’t nearly enough of them to go around. At least not the ones that have the time and interest to deal with us or our study. But publish we must, so we do, perhaps hoping that peer and ethical review might prevent us from making any serious errors, and finding reassurance when we see papers with stats sections and analyses that look just like our own, even in “high impact” journals. And maybe, just maybe, some of you secretly believe that what we do doesn’t actually matter for anything beyond career progression, so what’s the harm in playing along?

But the harms are real. This deficit in statistical expertise means that studies aren’t optimally, or even correctly, designed; data aren’t handled or analyzed appropriately; and results aren’t interpreted and reported as they should be. Many scientific studies thus fail to provide useful or accurate information, and in medicine this means that useless treatments are foisted onto patients, while discoveries of useful treatments are delayed. Evidence for this research waste is now overwhelming. Ignorance is no longer possible. From this point forward you can either pretend it doesn’t exist or do something about it.

Elsewhere I’ve argued, as have others, that fixing this problem will require research institutions to hire many more applied statisticians. Unlike the “real” statisticians, who tend to have PhDs in applied mathematics or statistics, and focus on expanding the boundaries of statistics, the lowly applied statistician can emerge from any number of backgrounds (nutritional epidemiology and biological anthropology in my case) and their focus is on the correct use of established statistical methods to help produce high-quality research in their field. However, while I still believe that we need many more applied statisticians in academic medicine and public health research, it's become obvious to me that we aren’t going to get them. Not in my lifetime anyway. Thus, for the majority of researchers who want to avoid research waste due to deficits in statistical expertise, you are simply going to have to become your own applied statistician. Congratulations, and welcome to the club.

What follows are some points of encouragement and other resources for the exciting journey ahead of you.

The most important thing to understand is that you have to play the long game. There are no shortcuts. No “learning data science in 8 easy blog posts” bullshit. You need to accept that progress will be measured in months and years, not hours and days. Trust me though, five years goes by quickly, and it’s more than long enough to turn yourself into an excellent applied statistician. So please don’t be put off by the abyss of ignorance you are currently staring into. You are going to steadily fill it up with knowledge, bit by bit. I’ve listed lots of resources to help you below, but don’t let it overwhelm you. Bit by bit. And be brave!

While you’re being so brave, let’s start with the scariest bit for many people. Math. The good news is that you don’t need to be a mathematician to be an effective applied statistician. But you do need to get comfortable working with numbers if you aren’t already.

You’ll regularly need to do simple calculations and transform data using different functions, especially logarithms. You’ll need to be able to work with probabilities. It also helps to have a general sense of what calculus and matrix algebra are used for, though you won’t need to directly use them - whew! You’ll also need to be able to stare down an equation or two, at least long enough to try and make some sense of them. But that’s about it.

To be clear, I don’t mean to minimize what all of that represents for many people, particularly those with any math anxiety. But I do think it’s impossible to be a good-enough applied statistician without being numerate. For whatever it’s worth, I never got past Calculus I in college, I’ve never even taken a formal class on linear algebra, and my vision goes blurry anytime I see a page with more than one equation on it. That said, I do try to work on my math skills and numeracy, at least a bit, everyday. Thankfully we live in a world full of thoughtful experts trying to help the rest of us mere mortals become more numerate.

In praise of “stupid” questions, by Eugenia Cheng (“We could let go of needing to teach people the rigorous math behind the statistics.”)

Brilliant An app full of interactive lessons in math, logic, probability, statistics, and other fun stuff. I use this every day.

Statistics 110: Probability

Khan Academy

The Joy of X and Infinite Powers

3Brown1Blue

Essential Mathematics for Political and Social Research

Math3ma

Numberphile

For those of you who worry about the math, I have more good news. Statistics is more about ideas and thought experiments than you might realize. The foundations of statistics are epistemological. It’s about questions like How do we know what we think we know? What do words like evidence and inference mean? How does a statistical hypothesis differ from a scientific one? How are they related to decision making? How can any of this help us explore and describe the natural world? How can it help us improve our own?

These questions are just as much a part of statistics as understanding how to calculate a z-score. More so even. Unfortunately, these deeper roots of statistics are often ignored in our training, where we tend to focus on what to do and very little about why we do it. It’s important that you remedy this. And for many of you, the “whys?” are the fun part, which is even more reason to engage with the philosophy and history of statistics. And as a bonus you’ll start to understand why statisticians never seem to be able to give you a straight answer, or even agree amongst themselves.

Understanding Psychology as a Science

Improving your statistical inferences

Probability Theory: The Logic of Science

You May Believe You Are a Bayesian But You Are Probably Wrong

Dicing with death

Deborah Mayo’s books and blog

Now let’s put the “applied” in applied statistics, and for that you need tools. The tools most of us learned in the beginning were a multitude of statistical tests and procedures. However, they were so numerous and arcane (since you lacked training in the fundamental ideas that tie them all together) that you literally needed a flowchart to tell you which one to use. But what if you could replace all those tests and procedures with a single tool? Would that excite you? Well you can! It’s called a Linear Model, and as the links below will make clear, just about every statistical test under the sun can be applied in the form of a linear model. So just skip the tests and move right into learning about linear regression and build from there.

Common statistical tests are linear models (or: how to teach stats)

IJLAM (It’s Just A Linear Model) - Daniela Witten

Statistical Rethinking

Finally, to be a good applied statistician, you should also invest time in statistical programming. Understanding what model to use and how to correctly interpret it are vital for avoiding research waste, but not completely sufficient. You also need to be able to safely use the model and the data it’s fit to. I’ve written about this in much more detail here, but in a nutshell, I encourage burgeoning applied statisticians working in medicine and health to learn about R and RStudio. While Stata is also great, and some of you might be forced to use SAS with a gun to your head (SPSS users should just take the bullet), R and RStudio will in my opinion provide you with the most complete platform for applying statistics safely, reproducibly, and efficiently.

RStudio Primers

More RStudio training

swirl

UCC PG 6030: Reproducible Research Practices using R

Evan

In 1962, my father made a move from UCLA to the University of Oregon. They had Sputnik money to expand their Statistics and Sociology departments, particularly grads and postgrads. He was there for more than a decade. When he returned from sabbatical about 1973, he was told he could no longer give any grade lower than a C in his Statistical Methods course (which washed out more sociology students than any other single class). The Sputnik money had apparently run out.

He quit in protest, of course.

I tell a small group of people that being raised by statisticians was like being raised by very numerate wolves.

I had no idea what my father had really done for a living (aside from teaching) until I took a statistics refresher in grad school. It was my first classroom with wifi, and being bored, I searched for something the professor mentioned in passing and found “Ecological Correlations and the Behavior of Individuals” by W. S. Robinson — my father. He proved the ecological fallacy, although there were inconsequential math errors in the paper and he never used the term “ecological fallacy”.

Expand full comment

James Meyer

Another excellent resource which focuses primarily on medical and biomedical research is https://discourse.datamethods.org/

1 reply by Darren Dahly, PhD lol FFS jFc

8 more comments...

Life is pain, especially your data

Discussion about this post