No, you can't explain what a p-value is with one sentence (Parts I, II)
But I'm going to try and do it in as few as possible. #fml
Part I
Sometimes math, science and technology progress at breakneck speeds. Other times they inch forward in spurts and stops. And sometimes they hit a brick-wall, when a given problem sits unresolved for decades, or even centuries. Examples in math and computer science abound. This essay is about one of the most insidious of these open challenges - how to explain what a p-value means in a single sentence.
It seems like everyday brings a new example of someone mangling the definition of a p-value, usually by saying something like it’s the probability that this result was due to random chance alone. Other times the mangling is more violent. For example, one of the most popular videos about p-values on YouTube (~500k views) defines it thusly:
If you didn’t recoil in disgust after reading that, don’t worry, I’ll explain why it’s wrong later. Then we can share this pain together. But first I want to assure you that you’re in good company if you didn’t catch the flaw(s) in that definition. In fact, if you want to terrify your friendly neighborhood statistician, simply invite them for a coffee, gain their trust with a few pleasantries, and then ambush them by asking, “Hey, by the way, can you explain what a p-value is? I just never really got it…” We are all terrified of getting this wrong, mainly because we often do.
Perhaps the most telling example of a smart, educated expert mixing up the definition of a p-value was in Hannah Fry’s excellent essay What Statistics Can and Can’t Tell Us About Ourselves, published in the New Yorker almost a year ago today. It is something that anyone who is interested in statistics should read, but when you get to the end you will see a small note:
I’m not going to stop here to parse what was originally written — we’ll get to all of that shortly. The point is that here we have an objectively brilliant person with an excellent understanding of statistics, someone who is rightly lauded for their ability to explain complicated mathematical concepts to the rest of us, writing in one of the world’s most famous and respected periodicals, in an effort to explain and clarify the role of statistics for larger society. And they got the p-value definition wrong. Like I said — good company.
Fortunately I can now share the exciting news that I have finally solved the formerly intractable problem of explaining what a p-value is in one sentence. Statisticians can stop living in fear of saying the wrong thing, and scientists can start to move towards a true understanding of the concept.
The solution to this problem is to explain p-values using more than one sentence. Amazing. I know.
Joking aside, of all these things we get wrong about statistics education, this apparent obsession with explaining p-values in a single sentence is one of the most harmful. You may have noticed that a lot has been recently written about p-values and their value to science (or perhaps their lack thereof), and one of the most commonly used arguments against the use of p-values is that they are hard to understand. However, what these people are really pointing out is that p-values are hard to understand if you are only allowed one sentence to explain them. I’ve never understood the logic of this, despite how often it’s made by people I know, respect, and even idolize. It’s like asking if the fuel injection system in my minivan is useful or not based on whether I can explain it to you in 30 or fewer words.
This never-ending quest for a satisfying single-sentence definition of p-values is telling though, as it both reflects and reinforces our completely dysfunctional approach to statistics (at least in the medical and public health sciences that I am most familiar with). Statistics is often taught on a superficial level, as if scientists can’t, or shouldn’t, be bothered to understand the actual meaning of things like p-values and the reasons for their existence. Do scientists not often generate and interpret statistical results? Are the scientists not smart enough to understand the p-values they use? That doesn’t seem right. Most scientists are smart, at least smart enough to understand p-values; and many use them on an almost daily basis, because statistics is not just some niche or tangential scientific topic — Statistics is fundamental to the entire scientific enterprise, and the failure of scientists to understand the topic has led to real harms. So why do we insist on dumbing so much of it down?
So to me it seems obvious that we need to start trying to teach people what p-values are by any means necessary, even if that means using two or more sentences (gasp!). This isn’t just so people can use and interpret them correctly — it’s also so they can properly evaluate alternatives, which I find just as tricky to understand (and certainly as tricky to employ). In other words, even if you want to get rid of p-values, I would hope that you want to do it through education, and not just taking advantage of people’s ignorance.
Part II
What follows is an attempt to explain what p-values are. To be clear, this is not an attempt to argue whether we should use them. We can have those discussions later on. The first goal here is to help you gain a good understanding of what p-values are so we can have more inclusive, productive conversations about statistics going forward.
The Big picture
To understand p-values (and other statistical tools and concepts), we must first talk about science, which is ultimately about trying to make sense of the empirical world that can be directly observed or experienced. Scientists thus tend to obsessively measure things, such as the velocity of a planet on its path around the sun, or the percentage of scientists in the world who can correctly define a p-value in one sentence.
A challenge for scientific progress is that measurements are never perfect. They are always estimates. Sometimes estimates are imperfect because our measurement instruments are limited in meaningful ways. Other times, estimates are imperfect since the measurements we collected are a just a small subset of all possible, relevant measurements. The science of statistics is about evaluating just how imperfect, or uncertain, our estimates really are.
This is a simplified description of the field's overarching purpose, though one I think most statisticians would agree with. How we might do this is fiercely debated though. This contentiousness is evidenced by the multiple schools of statistical thinking which emerged from efforts to solve problems in fields as disparate as gambling, astronomy, agriculture, and more. I don’t want to get side-tracked describing and contrasting these schools of thought. At least not yet. For now I just want to point out that they can all co-exist because statistics is just kind of made up.
For example, take the idea of probability, which we use a lot in statistics. Probabilities have an indisputable mathematical definition and a set of rules that govern how we can manipulate them. But what is the nature of probability? What does it mean? I can pretty much guarantee that whatever answer you might come up with, I can find someone just as smart as you who disagrees. That’s what I mean when I say statistics is made up. It’s foundations rest on ideas and thought experiments. Statistics uses math, but it isn’t math. Statistics is philosophical. It is epistemological. This is important for you to understand as a learner who might find p-values and statistics confusing. Of course you do! It’s all made up, so go easy on yourself. This is why I believe that the value of different statistical tools can only be realized through their application to specific problems, and in particular how they influence decision making.
With this in mind, I now want to focus on the specific problem of measuring a cause-effect relationship using a randomized controlled trial (RCTs), and how we might use p-values to think about the uncertainty of the resulting estimate. Importantly, I have chosen this context because it's the one for which p-values make the most intuitive sense.
The Estimate
Say you are concerned about researchers' poor understanding of p-values, and develop an educational intervention to remedy this problem. Then you decide to run an RCT to see if it actually works. Researchers are recruited and randomized into two arms, one of which receives the educational materials on p-values (the active arm) while the other gets similar material but about dog grooming (the control arm).
At the end of the study, you give the researchers an exam about p-values, where higher scores reflect better understanding, and compare the mean scores in both groups. This difference is your estimate for the average treatment effect of the intervention (why this is the case is a discussion for another day). So the trial statistician calculates this difference in mean scores and informs you that the average score of participants allocated to the active arm was 10 points higher than the average score of participants allocated to the control arm.
You pause briefly as the result sinks in, and then erupt in celebration. Eureka! Never in your wildest dreams did you ever expect such a wonderfully positive result. In fact, there’s never been an educational intervention that did such a good job of improving understanding of p-values. 10 points! You start to plan your Ted Talk.
That’s when you feel a gentle tap on your shoulder. It’s the statistician again. Why aren’t they jumping for joy? They aren’t even smiling. Their mouth starts to move: However, the observed difference was not signifi…The world blurs. Then it goes black.
The recorded cause of death?
“P > 0.05”
The Sampling distribution
So what is this p-value and where did it come from? It came from a sampling distribution, so let's start there.
Remember when I said there were different schools of statistical thinking? Well, a number of them fall under an umbrella that we call frequentist statistics, which includes tools and concepts like p-values, confidence intervals, type 1 error, and power - and all of these things are based on the idea of a sampling distribution.
Now imagine finding a shiny new coin on the sidewalk. You pick it up and flip it 10 times, which results in 4 heads and 6 tails. Then you flip the coin again 10 more times, this time getting 5 heads and 5 tails. Are you surprised? Probably not.
Let's now define “an experiment” as flipping the coin 10 times, and an “outcome” as the number of heads observed in each experiment. Now imagine that you could repeat this experiment an infinite number of times. If you could actually do this, you would expect to observe some distribution of outcomes across these hypothetical, repeated experiments. Same coin. Same measurement procedure (flip 10 times and count the heads). But different outcomes across experiments, creating a hypothetical distribution of said outcomes.
Now let's extend this kind of thinking to our RCT of your educational intervention. Say you enrolled six participants and randomized them into two groups of three people, but accidentally and unknowingly gave both groups the control arm intervention (oops!). Not realizing your mistake, you then went on to give the participants the p-value exam. What should you expect to find? It depends, but solely on how the participants happened to fall into the two groups when you randomized them. This is because we should still expect variability in exam scores, despite the lack of an "active" intervention, since some people will already know more about p-values than others, and there is usually some chance involved whenever we take a test (e.g. making a few lucky or unlucky guesses). We can thus imagine that the mean score in each group, and therefore the difference in mean scores comparing the groups, would also vary across hypothetical replications of the same study, purely as a function of who happened to be randomized into each group. Same people. Same measurement procedure (randomize into arms, give exam, calculate between group difference in mean exam scores). Different estimates across hypothetically repeated experiments.
Let's make this more concrete. Say the exam scores in the active arm were 46, 49, and 55, resulting in an average score of 50 in that group; and the scores in the control group were 30, 45, and 45, for an average of 40. That would make for a between arm difference in mean scores of +10, which was exactly the same result that got you so excited before your untimely death. Again, in this hypothetical universe, this result had nothing to do with any intervention on our part. It was purely the result of chance. There is another, equally plausible universe where the randomization could have reversed these groups, where the between-arm difference in mean scores would have been -10 points (a disaster!); and 18 other equally possible randomizations that would lead to estimates that fall between these two extremes (illustrated below)
So knowing what you now know about the distribution of possible estimates in this contrived situation where you know that there wasn’t an effect of the intervention at play, how does that make you feel about that initial finding? Does it seem so spectacular now? Perhaps not. After all, knowing that there was no actual effort to improve people’s understanding of p-values, we would still expect to see an estimate equal to +10 points in 3 of 20 possible randomizations. That’s 15% of the time. Some people would call this frequency of successes (3, in the numerator) relative to all possible experiments (20, in the denominator) a probability. Let’s call them frequentists.
So let's pull this altogether. We have an estimate, which is calculated from the actual data we have observed in our study. We have a sampling distribution, which in this example is the probability distribution of all the estimates we would expect to see if there was no intervention (or if the intervention didn't work) and we were able to replicate our study an infinite number of times (you can also envision other sampling distributions that don’t assume the intervention doesn’t work, which we will discuss later). Finally, we compare our estimate to that sampling distribution to extract the probability of observing our estimate if (this is a big if) the assumptions we used to imagine that sampling distribution hold. That probability is the p-value.
But what does it mean!? That depends on whom you ask, and we’ll get to some of those explanations below. For now, however, I want to offer the main way I’ve come to see p-values in the context of an RCT. When we get to the end of a trial and see an estimate of the average treatment effect suggesting the intervention is beneficial (relative to control) we should resist the urge to do back flips. We should instead be good scientists and first consider that what we’ve observed might be plausibly explained in other ways. The most obvious alternative explanation in an RCT is that what we’ve observed is just the result of sampling error. So we work out what the sampling distribution for our estimate would be assuming the intervention didn’t work and then some compare the estimate we actually observed to that distribution. If that estimate isn’t unusual with respect to that sampling distribution, which is to say that our estimate could have been plausibly generated by the sampling error that comes along with randomization, then we probably need to stop with the back flips. However, if estimates like the one we observed are rare under that sampling distribution (i.e. if there is little concordance between the two) then perhaps we can safely rule out randomization as the main explanation for our estimate and allow ourselves to conclude the intervention actually did something after all.
Importantly, the p-value doesn’t tell you what the effect was. It doesn’t even tell you if the effect is big enough to care about. It just gives you a sense of whether your result could have been plausibly explained by randomization alone. It’s a useful tool, but just one of many, for making sense of uncertainty.
More to follow. One day. Perhaps.
Thanks for a really interesting read, love the blog. When is part 2 coming up?
Much more clear than the majority of explanations I came across so far. Thank you.