The trial statistician and the clinical investigator took a step back to admire their creation. A two-arm parallel randomized controlled clinical trial to evaluate the efficacy of a promising new drug, Amazaflux.
“The outcome is a wonder,” said the statistician. “It is relevant to the scientific question at hand, and the patients say it’s what matters most to them. Plus it can be accurately measured in a reasonable time frame.”
“Yes. And behold, our population, perfectly defined, in light of our aims.” said the CI.
“Indeed, but I think it’s the procedures for randomization, allocation concealment, and blinding that I’m most proud of. Stunning, if I do say so myself.” added the statistician.
“It’s a perfect trial,” they simultaneously exhaled. And with that, they patted each other on their respective backs and went about their busy days. But unbeknownst to them, two angels had been eavesdropping on their conversation.
…
The first angel, young and innocent, turned to their wiser companion, “They certainly seem satisfied with themselves, but is it really perfect?”
“You’ll need to cut them some slack my young friend. They are trying to do something truly difficult.”
“Please say more.”
“They want to estimate the treatment effect of Amazaflux for some group of patients. Have we discussed treatment effects before?”
“No, we have not.”
“Very well then. The first thing to understand is that this treatment effect must be stated in terms of one or more outcomes. In other words, they want to know the effect of Amazaflux on what?”
“Ok, that makes sense to me, please continue, wise one.”
“The next thing to understand is that treatment effects are always comparative in nature. They are the difference in outcomes observed under one condition versus those observed under another. Amazaflux versus the current standard of care in this case.”
“I see,” exclaimed the younger angel. “So all they need to do is look into a patient’s future assuming the current standard of care and see what their outcome would be, and compare that to the future outcome in the universe where they were instead treated with Amazaflux.”
The wiser angel smiled, delighted by the naivety of their companion. “That is of course what you or I might do, but remember, these are primitive creatures and can't observe the multiverse of potential futures.”
“Oh, yes, of course not. I see now. But then surely any attempt to know the treatment effect of Amazaflux is folly.”
“You are indeed right that they can never know the treatment effect, and certainly not for any specific patient. But they can still learn something very useful about it. In fact, their approach to learning about treatment effects is one of the cleverest things they’ve ever come up with.”
“Please enlighten me, good friend. I'm listening.”
The older angel smiled again, basking in the opportunity to dispense more of their hard-earned wisdom.
“The thing they might come to understand is called the sample average treatment effect. Imagine a sample of patients who are recruited into a clinical trial. Now picture measuring the outcome for each of them at some point in the future, and then calculating the mean of those outcomes across all patients. Are you with me?”
“I am. Please continue.”
“Then see yourself randomly placing each patient into one of two groups, calculating the mean outcome in each group, and comparing them. What would you expect to see? Would they be different?”
The younger angel thought for a moment.
“Well, that depends on what you mean by different. Assuming the two groups are similarly treated, I would expect the mean outcome in the two groups to be the same as the mean for the overall sample. In other words, looking across a multiverse where this procedure has been perfectly repeated an infinite number of times, I would expect the average, or expectation, of those between-group differences in means to be zero.”
“Assuming the patients were truly placed into the groups at random, yes?”
“Indeed.”
“However,” the young angel added, “for any particular realization of the process you’ve just described, I’d be very surprised to get a between-group difference in means of exactly zero.”
“And why is that?”
“Well, for example, I know that the expected outcome for rolling a pair of 6-sided dice and subtracting their results is exactly zero. Yet I wouldn’t be at all surprised if I rolled the dice and took their difference and got a result other than zero. In fact, for any single roll of the dice, I would only observe a difference of zero about 17% of the time. But if I were to replicate this procedure over and over again, recording the outcomes along the way and calculating their mean across all the replications, this value would surely approach the value of zero as the number of replications increased to infinity.”
“Bravo. You are exactly right my young friend. To summarize, randomization creates two groups that are comparable in expectation. Thus, if we expose the two groups to different treatments, such as Amazaflux versus the standard current of care, and we see a notable between-group difference in mean outcomes, we might feel comfortable concluding the difference in treatments explains that difference in outcomes. Further, we might use that between-group difference in mean outcomes that we observed as our estimate of the average treatment effect. That estimate will be unbiased, given the comparability of the groups that results from randomization, so that if we were to imagine steadily increasing the number of patients enrolled in the trial until there were infinitely many of them, the resulting estimates would inexorably approach the “true” treatment effect, just like your example with the dice. But there would still be error in the estimate due to the finite nature of the sample and measurement error. But the severity of these errors can be lessened by recruiting more patients into the study.”
“Ok, I’m glad to know we are on the same page thus far. But earlier you made the point of calling it the sample average treatment effect?”
“Indeed. Should I say more?”
“Please. I think I understand why it's an average treatment effect, but…”
The wiser angel interrupted. “Before I continue, can you explain how you can see that it's an average treatment effect please? Just to make sure we are in agreement before moving to more complicated matters.”
“I'll try. We already established that the primitives can't see the treatment effect for any particular patient, because they can't observe their outcome under each treatment condition.”
“That's correct. They refer to this as the fundamental problem of causal inference. A fundamental problem for them, that is, not for us. But please continue, young one.”
“So if I first imagine that the treatment has the same effect for all patients, and I know that randomization creates groups that are comparable by expectation, then adding that constant treatment effect to the outcome for patients who received Amazaflux would simply shift the mean of outcome values in that group by exactly that amount. So the difference in mean outcomes between the two groups would reflect that constant treatment effect.”
“Very good.”
“And even if I imagine that the treatment effect varies across patients, there would still be a mean treatment effect across all patients. And since the groups resulting from randomization would also be comparable in expectation for these individual treatment effects, the mean of the individual treatment effects would still be reflected in the between-group difference in mean outcomes.”
“Even better my friend. And this idea of heterogeneity in treatment effects brings us right back to the sample in the sample average treatment effect.” the older angel said with visible satisfaction. “Shall I pick back up from here then?”
“Please do, wise one.”
“Remember at the start I said that their goal was to understand the treatment effect of Amazaflux for some group of patients. In a perfect world, they would be able to identify any and all patients that might reasonably benefit from treatment from Amazaflux, take a random sample from that population of patients, and enroll them into the trial. But this is clearly a fantasy. First, even the notion of identifying any and all patients who might benefit from Amazaflux, now and in the future, is an impossibility. The notion that we might get a random sample from this population is another one. It could never happen for many reasons. However, the most relevant of these is that patients can only be enrolled into a clinical trial as they present for care. So the sample of any patients enrolled in a clinical trial are the patients that happened to be identified, screened, contented and enrolled in the very specific time and place where the trial exists. They can never be a random sample from any population. Are you with me, young friend?”
“I am. Continue.”
“Next, if treatment effects did vary, they are likely to vary for a reason. For example, the effect of the treatment might be different for patients with severe illness than it is for those with more mild disease, or perhaps it varies based on the presence or absence of a particular gene variant, or maybe it varies based on what other treatments the patients are on. It could even vary in impossibly complex ways on the basis of many such factors. So if the composition of two samples differ with respect to these factors, then the respective average treatment effects will also be different. Thus the two arm parallel randomized controlled trial the primitives have designed will indeed provide them with what they call an unbiased estimate of an average treatment effect for Amazaflux, but it’s the average treatment for that particular sample.”
“And you think this is clever!?” the innocent angel gasped. “How can you call it clever? We’ve established that there might be heterogeneity in treatment effects, and that the patients who are enrolled in any particular trial could be different from other relevant patients, perhaps in important ways. So if the goal is to understand if any and all patients who might benefit from Amazaflux might actually benefit, this hardly seems like a clever way to go about it!”
“Everything is relative, my friend. So what’s a cleverer way?”
The innocent angel sharply drew a breath, ready to respond in force, but no words followed. Their brow furrowed and they thought as hard as they could, but no cleverer way appeared.
“I relent, wise one. Given the constraints facing the primitives, I can’t think of a better way. The only thing I can think of is to run more trials and bigger trials, so that they are able to capture as many different types of relevant patients as possible, and then perhaps they can estimate average treatment effects within different subgroups that might be of interest. But this doesn’t seem very clever and I feel like you are about to enlighten me further.”
“Ahhh, you are wise indeed, young friend. The next thing to understand is that demonstrating heterogeneity in treatment effects in a manner that can actually be useful for guiding care is much more difficult that demonstrating an average benefit of treatment. Thus our understanding of the average will always precede our understanding of the specific or the exceptional. Does that make sense?”
“It does.”
“Excellent. Next, we must also consider that our goal is to get as many patients as we can onto improved treatments as soon as possible. So if the early trials strongly suggest substantial average benefit in some group of patients, it might be ethically dubious to continue withholding that treatment for other patients that might also benefit. At this point, it becomes a matter of clinical judgment rather than scientific investigation. The crux of the problem is that one could always imagine some special type of patient for whom the treatment works differently. The possibilities are endless. But at some point a decision to make this medicine more widely available must be made. And once that decision is made, they lose the ability to continue testing the treatment effects of the new drug in randomized controlled trials because that would require withholding the treatment from some patients who might be reasonably likely to benefit.”
“Ah, yes, I see,” said the younger angel. “But does that mean they can never identify heterogeneity in treatment effects?”
“No, not at all. But the problem becomes harder since the treatment can no longer be assigned at random. This means that the primitives can no longer safely assume that the group of patients they observe receiving the treatment are comparable in expectation with patients who didn’t receive it. That’s because there are almost certainly reasons why some patients received it while others didn’t, and those reasons might also be related to outcomes. The only way to completely overcome this is to already know what these reasons are, any and all of them, and account for them in the study design or analysis of the study data. That said, such observational data can often include information on massive numbers of patients, many many more than could ever be enrolled into a randomized controlled trial, perhaps allowing for more exploration of heterogeneity in treatment effects, even in relatively tiny subgroups of patients. And smart people are always coming up with better ways to analyze observational data to learn useful things about treatment effects, despite the inherent challenges. I’ve even heard of cutting edge methods that combine RCT data, observational data, and a ‘tough nurse’ to arrive at truly personalized treatment effects, but I must admit that I don’t totally understand that one yet. But does all the rest of what I’ve said make sense?”
“Absolutely. I now see why you are so impressed with how they’ve harnessed randomness to help understand their world.” The younger but now wiser angel looked upon their companion with obvious admiration. “How do you know all this wise one? Has it always been so?”
“Ahh, not at all my good friend. In another life, before I came here, I was a trial statistician.”
“Oh, I see. I was just a computer scientist.”
“I know my friend, I know. Let’s talk some more about that tough nurse, shall we?”
And the two angels walked off, arm in arm, continuing their conversation...
Angel #1: Wait, you're telling me that almost 300 years after Laplace and Bayes explained how to do statistical inference correctly, the primitives are still using frequentist methodology? A methodology that is fundamentally incapable of telling them what they actually want to know, which is how likely it is that the effect is real?
Angel #2: Yes.
Angel #1: (weeps softly)
Super store Darren, until perhaps the very end. The need for observational data at the end could be questioned (as opposed to having the RCTs in the first place) and the bias in observational data from even estimating the sample average treatment effect will carry over to all “subgroup” effects. Also, just to make the story more complex :-) it could be mentioned that the RCT data alone can often be used to estimate how patients’ treatment effects (on an absolute scale) can be estimated as a function of patient characteristics in the RCT even in the absence of HTE.