Suppose \(X\) is a binary treatment where \(X_i=1\) means subject \(i\) was “treated” and \(X_i=0\) means subject \(i\) was “not treated”. Treatment refers to any kind of stimulus (e.g. a medication) that some members of a sample are subjected to. Let \(Y_i\) be an outcome variable, which is some measurable post-treatment quantity observed for all subjects in the sample (e.g. presence/absence of ailment). We’re interested in finding whether \(X\) has a causal effect on \(Y\).
Now let \(Y_{i1}\) be the outcome if subject \(i\) received the treatment, and \(Y_{i0}\) be the outcome if subject \(i\) didn’t receive the treatment. We can call \(\beta_i = Y_{i1}-Y_{i0}\) the treatment effect.
Immediately there are two problems:
We can define the average treatment effect (ATE):
\[\text{ATE} = \text{E}[\beta_i] = \text{E}[Y_{i1} - Y_{i0}]\]
The ATE is the average difference the treatment has on the outcome variable, averaging over all the subjects in the sample. This is what we observe in the data.
We can also define the average treatment effect on the treated (ATT):
\[ \begin{aligned} \text{ATT} &= \text{E}[\beta_i | X_i = 1] \\ &= \text{E}[Y_{i1}-Y_{i0} | X_i = 1] \\ &= \text{E}[Y_{i1} | X_i = 1] - \text{E}[Y_{i0} | X_i = 1] \end{aligned} \]
Note the second term in the above expression is not observed in the study.
Also define the average treatment effect on the control (ATC):
\[ \begin{aligned} \text{ATC} &= \text{E}[\beta_i | X_i = 0] \\ &= \text{E}[Y_{i1}-Y_{i0} | X_i = 0] \\ &= \text{E}[Y_{i1} | X_i = 0] - \text{E}[Y_{i0} | X_i = 0] \end{aligned} \]
where the first term in the above expression is not observed.
Note how the difference we do observe in the study (the ATE) can be expressed:
\[\text{E}[\beta_i] = \text{E}[Y_{i1} | X_i = 1] - \text{E}[Y_{i0} | X_i = 0]\]
\[= \bigg\{ \text{E}[Y_{i1} | X_i = 1] - \text{E}[Y_{i0} | X_i = 1] \bigg\} + \bigg\{ \text{E}[Y_{i1} | X_i = 0] - \text{E}[Y_{i0} | X_i = 0] \bigg\}\]
The first term in the curly brackets is the ATT—this is the quantity of interest as it gives how much the treatment affected the outcome for subjects that received the treatment.
The second term gives how much the treatment and control groups differ, even in the absence of treatment. This is known as selection bias.
The observed average treatment effect can be summarized as follows:
\[\text{ATE} = \text{ATT} + \text{selection bias}\]
i.e. the effect we observe (ATE) is equal to the quantity we want (ATT) plus another term representing selection bias in the sample. The ideal study would have no selection bias, meaning the observed average treatment effect is a valid measure of how much on average the treatment affects a subject. But if there is selection bias then the observed effect will be optimistic.
The solution: randomization.
If the treatment is assigned randomly, then \(Y_{i0}\) and \(X_i\) will be independent:
\[\text{E}[Y_{i0} | X_i = 1] = \text{E}[Y_{i0} | X_i = 0]\]
This is because if \(Y_{i0}\) and \(X_i\) are independent, then \(\text{E}[Y_{i0} | X_i] = \text{E}[Y_{i0}]\). If this condition is met, the selection bias term vanishes.
Thus randomization eliminates selection bias and ensures that ATE = ATT.
Ecological correlation refers to correlations observed at the group level rather than the individual level.
E.g. suppose we are examining the relationship between parental funding and student academic performance. [Read more on this here. A correlation observed at the individual level would use observations on individual students, comparing their individual values for funding and academic performance. A correlation observed at the group level would use average values for funding and academic performance, aggregated across certain groups in the data—e.g. gender or ethnicity.
It turns out that correlations observed at the group level can be vastly different to correlations observed at the individual level—even if they are observed on the same data. It is a common mistake to assume that correlations observed across one level of aggregation will hold in another—this is called ecological fallacy.
Simpson’s paradox is an insightful example of ecological fallacy.
Simpson’s paradox occurs when a trend is observable within certain groups of data, but vanishes or changes direction when the data is looked at as a whole (when the groups are combined).
A classic example: in 1973 the UC Berkeley Graduate Division admitted 44% of its male applicants and 35% of its female applicants, prompting controversy over the apparent gender bias against women. But when examining the admissions rates of individual departments, it was found that that six departments were biased against men, and only four weree biased against women. The paradox was later explained by the discovery that women tended to apply in larger numbers to more competitive departments (where both genders were admitted in small numbers).
Below is a mathematical treatment of the problem:
Suppose \(X\) is a binary treatment, \(Y\) is a binary outcome, and \(Z\) is some categorical variable like gender. Suppose the joint distribution of \(X\), \(Y\), and \(Z\) is:
\(Y=1\) | \(Y=0\) | \(Y=1\) | \(Y=0\) | |
---|---|---|---|---|
\(X=1\) | 0.1500 | 0.2250 | 0.1000 | 0.0250 |
\(X=0\) | 0.0375 | 0.0875 | 0.2625 | 0.1125 |
The marginal distribution for \(X,Y\) (i.e. looking at the distribution as a whole) is:
\(Y=1\) | \(Y=0\) | |
---|---|---|
\(X=1\) | 0.25 | 0.25 |
\(X=0\) | 0.30 | 0.20 |
From the second table (the combined data), we have the following:
\[\text{P}(Y=1|X=1) < \text{P}(Y=1|X=0)\]
which seems to say the treatment is harmful overall.
Yet, when taking the group variable (gender) into account we have:
\[\text{P}(Y=1|X=1,Z=1) > \text{P}(Y=1|X=0,Z=1)\]
\[\text{P}(Y=1|X=1,Z=0) > \text{P}(Y=1|X=0,Z=0)\]
which seem to say the treat is beneficial to women (\(Z=1\)) and men (\(Z=0\)).
This is a clear example of Simpson’s paradox—these three mathematical statements seeem to imply contradictory things about the nature of the treatment effect.
The reality is that these mathematical statements are not actually contradictory at all—what is wrong is our interpretation. We have assumed causality without first proving it.
To see this, let’s define association:
\[\alpha = \text{E}[Y|X=1] - \text{E}[Y|X=0]\]
i.e. this is the association between \(X\) and \(Y\). However this is not the causal effect of \(X\) on \(Y\). From section 20–1, we know the causal effect, or average treatment effect, is:
\[\beta = \bigg\{ \text{E}[Y_{i1} | X_i = 1] - \text{E}[Y_{i0} | X_i = 1] \bigg\} + \bigg\{ \text{E}[Y_{i1} | X_i = 0] - \text{E}[Y_{i0} | X_i = 0] \bigg\}\]
\[= \text{ATT} + \text{selection bias}\]
i.e. the observed association only demonstrates causality if there is no selection bias. Selection bias could include any number of confounding variables that have not been accounted for, but which contain relevant information to the problem.
The three mathematical statements from above are in fact not paradoxical at all—the paradox arises only in our assumption that \(\text{P}(Y=1|X=1) < \text{P}(Y=1|X=0)\) means the treatment is harmful overall, which it does not. It only describes an association, not a causal effect. Similarly, the statement \(\text{P}(Y=1|X=1,Z=1) > \text{P}(Y=1|X=0,Z=1)\) does not mean the treatment is beneficial to women—it describes an association only.
The takeaway:
\[\bf{\text{Association is not causation.}}\]
Is it ever possible to estimate the causal effect? The answer is sometimes. Randomized assignment of subjects to treatment will allow us estimate the causal effect. But without randomization, there may be any number of confounders that change the interpretation of the problem.