Out of balance
A perspective on covariate adjustment in randomized controlled trials in medicine.
The randomized controlled trial is a valuable tool for understanding the effects of interventions. Like any study design, RCTs have limitations; and they must be properly designed, conducted, and analyzed to yield useful insights. While those actually involved in RCTs usually seem to understand this all too well, others seem to think they don’t, as I am regularly reminded on social media.
These people are often just well-intentioned researchers working in areas where RCTs aren’t really possible, and so haven’t had the opportunity or need to understand them. So while it’s annoying when someone misexplains to me some well-understood aspect of RCTs for the umpteenth time, it’s fairly harmless behavior.
What we can’t ignore, however, are when misconceptions about RCTs are published in medical and scientific journals. To be clear, I do not have a problem with valid critiques of RCTs, nor of honest discussions of their limitations. Quite the opposite in fact — like many statisticians and trialists, I love talking about that stuff. Just ask my barber. But misconceptions dressed in authority are a real problem, since they can be thoughtlessly used by others to justify their otherwise unfounded distrust of trials. And what will happen when we get lots of these misconceptions echoing though the scientific literature?
Not good.
So with the naive hope that sticking my finger in the dam could actually make a difference, I want to try and clarify a common misconception about RCTs, and explain how it contributes to sub-optimal trial design and analysis.
The misconception I want to discuss is the claim that we use randomization to balance confounders, a claim that has been published about a million times, by experts and novices alike. It’s so common in fact that you might think I’ve taken leave of my senses to suggest that it’s not true. But it isn’t. Not only is the above statement wrong, it’s wrong twice. We don’t use randomization to balance covariates in general, and we certainly don’t use it to balance confounders.
So maybe at this point you are saying, “Of course we use randomization to balance covariates, you daffy fool. It’s right there in CONSORT!”
First, when properly implemented, it [randomization] eliminates selection bias, balancing both known and unknown prognostic factors, in the assignment of treatments.
So how is this wrong?
When we use an RCT to evaluate an intervention, we do so with respect to one or more endpoints (or outcomes) that will be measured in the future, after the period of intervention. It could be blood pressure, death, quality of life, etc.
We want to understand the causal effect of the intervention on that outcome, but this is tricky. That’s because to really understand the effect of the intervention, we would need to give it to someone and measure the outcome to see what happened. Then we would need to reset the universe back to the exact point when the intervention was given, withhold it this time, and see what happened when they were left untreated. The difference in the outcomes between the two scenarios would be our estimate of the causal effect of the intervention.
This is clearly a fantasy, but hope is not lost. Thankfully we can mimic this counterfactual situation by randomizing people into groups, and since we are now talking about groups, we have to start talking about distributions of future outcomes.
So imagine that we have a group of patients that will soon enroll on a RCT, and we send them all to a clairvoyant who tells each of them what their future outcome will be if their life just proceeds as usual (i.e. if they never actually enroll on the trial and thus don’t receive any intervention). Everyone writes this information down on a piece of paper, folds it up, and hides it away.
Some people might have a poor future outcome, while others have a better outlook. So while you can’t see any individual’s future (unlike the clairvoyant), you do know there will be a distribution of outcomes among the study participants; and depending on your clinical experience with the outcome in these kinds of patients, you might be able to make some educated guesses about the nature of that distribution, such its average and variance. (In fact, if you can’t make an educated guess about the distribution of the outcome in your clinical population, I would argue that you aren’t qualified to run a trial…but I digress.)
Now lets randomize these participants into two groups, intervene in one and not the other. Then we measure the outcome at the end, and compare the two distributions, finding that they are different (e.g. the mean outcome in the intervention group is substantially better than that of the control). Now you have to answer the question, “did the intervention work?”
Well, maybe. So let’s cheat reality and ask all of the patients to pull out the pieces of paper with their futures written on them. You carefully write down the data and plot the distributions for the two groups, finding that they completely overlap — they are, for all intents and purposes, the same — the groups are exchangeable — they had the same baseline risk when the study began.
So when you intervened in one group but not the other, the distributions of the outcomes were the different. But had you done nothing to either group, their distributions would have been the same (according to the clairvoyant). The intervention worked!
So what exactly did randomization do for us here? Did it guarantee that the two groups would have the same distribution of future outcomes? No, there is no such guarantee. However, we know that the the chances of there being a substantial difference between them will drop as the sample size increases. So what randomization allows us to do is make probabilistic statements about the likely similarity of the two groups, with respect to the outcome.
So back to our misconception. Please note that at no time have I mentioned the word covariates. The only thing I care about is the distribution of future outcomes (which is what people really mean when they say balance). To drive the point home, let’s say that after we saw the two randomized groups shared the same distribution of future outcomes (because we looked at their notes from the clairvoyant), we noticed that one group had all the people with a strong prognostic indicator for the outcome (e.g. family history of hypertension in a trial where blood pressure was the primary endpoint). Should we care? No — conditional on the knowledge that the two groups have the same distribution of future outcomes, it makes absolutely no difference if there are other dissimilarities between the groups. Let this sink in.
At this point you might say, “But we can’t know whether the groups are exchangeable — we don’t have a clairvoyant!” And of course you would be right. But returning to the point I made before— the goal of randomization isn’t to create groups that are certainly the same. It’s to help us make probabilistic statements about how similar they might be, with respect to the only thing that matters, the baseline risk (i.e. the distribution of their future outcomes).
Randomization of course does the same thing for all the other characteristics of the patients, measured and unmeasured. That is what the CONSORT statement above is saying — so it is technically correct after all. My objection is that it completely focuses on the importance of “balancing…prognostic factors” and doesn’t mention the outcome at all.
This is completely counter to how we actually design trials, which we do with the outcome in mind. If the outcome is highly variable, then you know to run a larger trial and/or use a less variable outcome, in order to drive down the chance that there will be a difference in the baseline risk for the trial arms that affects your interpretation of the trial’s results. Importantly, the sample size you choose won’t necessarily allow for similarly palatable probabilistic statements about differences in the distributions of other covariates (e.g. those that are even noisier than your outcome), but that’s OK, for the reasons we’ve just discussed.
So what is my complaint about referring to the covariates as confounders? It’s simple really. They aren’t confounders!
First, confounding is a type of bias, and biases are things that lead to incorrect estimates in a systematic fashion — so that if you repeated your entire experimental procedure, a given bias should be there every time. This is contrasted with an error, which is another way for our estimate to be wrong— but the effect of error can vary from one replication to the next. So if I put my finger on the scale and systematically steer the heavy smokers into one of my trial arms for a study of lung function, then that’s a bias. However, if I randomize, and I happen to get a disproportionate number of them one arm, then that’s just dumb luck — an error whose nature will change from one replication to another.
I also cringe at referring to covariates as confounders because it’s much better to think of the problem, in general, as one of confounding *(see the note at the end) — something that in the context of an RCT has already been eliminated by the randomization (yes…assuming we did everything correctly). We can of course explain this with a sample DAG.
Since every arrow should reflect a causal relationship, it’s not possible for there to be an arrow pointing from a covariate to the allocation, since it is done at random. With no arrow pointing to the exposure, there can be no unblocked backdoor path, and thus no confounding. Voilà.
This is all nicely explained in one of my all time favorite papers in Epidemiology, A Structural Approach to Selection Bias, by Hernán, Hernández-Díaz and Robbins; and another classic, Causal diagrams for epidemiologic research (also in Epidemiology), by Greenland, Pearl and Robins. However, despite how critically important these papers are for understanding why covariate imbalances have nothing whatsoever to do with confounding, both Hernán and Greenland seem to be ok with the term random confounding, which I think should be immediately submitted to the oxymoron hall of fame. (Note: Sander Greenland kindly offered some useful papers on this topic here, in support of the concept of random confounding - as with everything in statistics, it’s a bit more complicated than I’ve allowed for here).
Why does any of this matter?
To recap, randomization is about being able to make probabilistic statements about the likely distributions of future outcomes (or baseline risk) in our study arms — and when there are non-trivial differences, this is an error, not a bias, and certainly not confounding. So why should you care? Am I just being a pedantic dullard, hung up on a few words?
One meaningful problem that follows from this misconception is that people start to believe in a boogeyman called infinite confounders, which leads them to mistakenly discount the value of randomization. It’s the idea that even with randomization that there could always be some kind of strong prognostic factor that we didn’t measure that is disproportionately distributed across the trial arms that wrecks the validity of our results. Taken to the extreme, you even have people arguing that the trial arms should be very tightly “balanced” for every possible characteristic to make valid inferences about an outcome (e.g. this obscene paper in PLOS One).
What the worrywarts don’t seem to understand is that these imbalances only matter to the degree they explain variation in the outcome; that they can and do offset each other; and that their ability to affect our interpretation of the results are completely bounded by the variance of the outcome (REF), which you already took into account when designing the study and setting the sample size.
The other consequence of this misconception, and the more important one in my view, is that it leads people to misuse covariate information. The wrong way to use covariates is to look for treatment arm “imbalances” at baseline (prior to randomization), usually with a simple statistical test where p < 0.05 equals an “important imbalance”, and then include the covariates thus identified in the model used to estimate the treatment effect.
Despite the expert advice noted just above, it’s a painfully common procedure, and I think this has a lot to do with the “balancing confounders” misconception we are discussing.
The crux of the problem is that there are different ways to use a linear model. In epidemiology, where concerns about confounding are supreme, we use linear models to adjust for confounding by including the appropriately selected covariates, resulting in conditional probability statements that help us draw causal inferences (you can argue amongst yourselves how useful this is). Returning to the DAG displayed above, we know that a covariate can lead to confounding when it is a cause of both the exposure and the outcome — and since we know the allocation in an RCT can’t be caused per se, people fall back to a weaker condition that the covariate is associated with the exposure, which is where we get people talking about “random confounding”.
So if you view randomization as a tool for balancing confounders, then it might make sense to select those that are associated with allocation and adjust for them. However, there are a few problems that emerge from this. The first is that people lose sight of the outcome — we see people running tests for associations between covariates and the treatment arm all the time, but only rarely between covariates and the outcome. This is clearly thoughtless, especially since adjusting for covariates that aren’t prognostic for the outcome will no nothing to improve inferences, and can actually waste information, leading to less efficient estimates.
Another problem is that “important imbalances” aren’t going to be to consistently identified using bright-line statistical tests. There could easily be “non-significant differences” that would be useful to adjust for, and significant ones that aren’t (as I explained above — because they aren’t related to the outcome).
And finally, the biggest problem with this approach, is that it requires post-data decision making to specify your analysis model. In other words, you can’t specify your model until you have seen the data to identify the imbalances. This allows for the possibility that an investigator might pop different covariates in and out their model until they get the answer they are looking for, and then justify those choices after the fact. For anyone not paying attention to concerns about reproducibility and research integrity, this behavior leads to all kinds of problems with sharpshooter fallacies, p-hacking, and forking paths — i.e. it is very bad.
What we should do instead is pre-specify our analysis, which means choosing covariates to adjust for before we even collect the data. This means that we actually have to understand our outcome (lest we shouldn’t be starting the trial) — and understand it enough to know what the important, measurable prognostic factors are. And when we understand that randomization isn’t about “imbalanced confounders” and bias, but rather about taming error (i.e. otherwise unexplained variance) in the outcome, we can see the linear model from a more Fisherian point of view — as a tool for decomposing outcome variance. Thus the inclusion of the pre-specified prognostic factors in the linear model integrates out some of the variance in the outcome, and this reduces the variance of the estimator of the treatment effect — and it does this even if there is no imbalance across treatment arms (something I tried to demonstrate here). Huhzzah!
So if you’ve made it this far, thank you for listening. As ever, virtually nothing I’ve written is in any way original, and I’ve linked some resources below, mostly from Stephen Senn. If you have other suggestions for useful reading on the topic, please comment and I’ll add them to the list.
If you work in trials and haven’t quite gotten to grips with what I have tried to communicate — don’t worry, it’s not your fault! I’ve been trying to wrap my head around this stuff for at least a few years now, coming from a background in epidemiology. So keep working at it, keep reading, and if you have any questions that you think I can help with, you can find me on Twitter.
Stephen Senn. Seven myths of randomisation in clinical trials. Statistics in Medicine.
Stephen Senn. Covariate imbalance and random allocation in clinical trials. Statistics in Medicine.
Stephen Senn. Randomization, ratios and rationality: rescuing the randomized clinical trial from its critics. ErrorStatistics.com
Zad Chow. Myth: Covariates Need to Be Balanced in RCTs. LessLikely.com
Ewout Steyerberg. RCT Analyses With Covariate Adjustment
*This follows from a quote of Jamie Robins that I got second-hand, who when asked what a confounder was, answered “Whatever you need to adjust for to get the right answer.”
Excellent post!
Please link to:
https://www.fharrell.com/post/covadj/