1 Regression Validity

Now we’ll examine validity and sources of bias in regression models.

External Validity—that the observed effect holds true in other settings; that it depends on non-modeled conditions. Questions to ask: is the sample representative? Are the assumptions portable?

Internal Validity—that the observed effect is properly identified; that the assumptions of least squares regression are met.

The most important assumption for regression validity is that the errors are uncorrelated with the predictors:

\[\text{E}[{\boldsymbol \varepsilon }| {\boldsymbol X}] = 0\]

Violations of this assumption will cause LS estimates to become biased. To see this, recall that the regression function \({\boldsymbol y} = {\boldsymbol X} {\boldsymbol \beta }+ {\boldsymbol \varepsilon}\) can be expressed:

\[ \begin{aligned} \text{E}[{\boldsymbol y} | {\boldsymbol X}] &= \text{E}[{\boldsymbol X} {\boldsymbol \beta }+ {\boldsymbol \varepsilon }| {\boldsymbol X}] \\ &= {\boldsymbol X} {\boldsymbol \beta }+ {\boldsymbol \varepsilon }| {\boldsymbol X} \\ &= {\boldsymbol X} {\boldsymbol \beta } \end{aligned} \]

i.e. in order for \(\text{E}[{\boldsymbol y} | {\boldsymbol X}] = {\boldsymbol X} {\boldsymbol \beta}\), it’s necessary that \(\text{E}[{\boldsymbol \varepsilon }| {\boldsymbol X}] = 0\), otherwise the estimates will be biased.

Common sources of bias that cause \(\text{E}[{\boldsymbol \varepsilon }| {\boldsymbol X}] \neq 0\):

omitted variable bias
specification bias (nonlinear relationships)
measurement error bias (noise)
sample selection bias (sample not representative)
simultaneity bias (\(X\) and \(Y\) cause each other)

2 Omitted Variable Bias

The problem: correlation between omitted variable(s) and observed predictors makes LS estimates biased.

E.g. if the true model is:

\[{\boldsymbol y} = \beta_0 + \beta_1 {\boldsymbol x}_1 + \beta_2 {\boldsymbol x}_2 + {\boldsymbol \varepsilon}\]

but instead we use the following model for \(y\):

\[{\boldsymbol y} = \beta_0 + \beta_1 {\boldsymbol x}_1 + \tilde {{\boldsymbol \varepsilon}}\]

i.e. in our model we omit the variable \({\boldsymbol x}_2\). The error term in our flawed model is \(\tilde {{\boldsymbol \varepsilon}} = \beta_2 {\boldsymbol x}_2 + {\boldsymbol \varepsilon}\), i.e. it comprises the error of the true model and the effect of omitted variable(s). The LS estimator for \(\beta_1\) in our model would be:

\[{\boldsymbol b}_1 = ({\boldsymbol x}_1^T {\boldsymbol x}_1)^{-1} {\boldsymbol y} = ({\boldsymbol x}_1^T {\boldsymbol x}_1)^{-1} {\boldsymbol x}_1^T {\boldsymbol x}_2 \beta_2 + ({\boldsymbol x}_1^T {\boldsymbol x}_1)^{-1} {\boldsymbol x}_1^T {\boldsymbol \varepsilon}\]

and

\[\text{E}[b_1 | {\boldsymbol X}] = \beta_1 + ({\boldsymbol x}_1^T {\boldsymbol x}_1)^{-1} {\boldsymbol x}_1^T {\boldsymbol x}_2 \beta_2 \]

where the second term on the right represents the bias in the estimate of \(b_1\) as a result of omitting \(x_2\). Note this term becomes zero if \({\boldsymbol x}_1\) and \({\boldsymbol x}_2\) are uncorrelated (this would mean \({\boldsymbol x}_1^T {\boldsymbol x}_2 = 0\)).

The solution: include any variables that are correlated with the predictors. These are confounding variables that should be controlled for.

Here’s a very rudimentary example: the following dataset has three variables, recording the mileage, age, and maintenence expenses of some cars. First, let’s run a simple regression predicting maintenance expenses from mileage:

lm(expenses ~ mileage, data = carexpenses)

## 
## Call:
## lm(formula = expenses ~ mileage, data = carexpenses)
## 
## Coefficients:
## (Intercept)      mileage  
##     1984.52       -21.04

Strangely, the coefficient on mileage is negative, suggesting that cars with higher mileage somehow have lower maintenance expenses.

But when we add age as a predictor, watch what happens:

lm(expenses ~ mileage + age, data = carexpenses)

## 
## Call:
## lm(formula = expenses ~ mileage + age, data = carexpenses)
## 
## Coefficients:
## (Intercept)      mileage          age  
##      631.05        17.68       127.58

i.e. after controlling for age, the coefficients show more reasonable results—that maintenance expenses increase with mileage and age.

This puzzling discrepancy can be resolved by looking at the relationship between mileage and age:

Clearly, in this dataset, the newer cars have higher mileage than the older ones. Strange though this may be, this vital piece of missing information explains the aberrant relationship between maintenance expenses and mileage observed in the simple regression—we had omitted an important confounding variable, age, which turned out changes the interpretation of the problem completely.

You can see how omitting variables that are correlated with the predictors can result in a loss of relevant information to the problem. This can cause strongly biased regression coefficients, and severe misunderstandings about the nature of the relationship between two variables.

Omitted variables are also known as confounding variables, as they confound the problem with additional complexity, often unbeknown to the experimenter.

This is another example of the danger of assuming causality. It’s important to remember that in general, nonzero regression coefficients only demonstrate association, not causation, unless all sources of confounding variables have been controlled for (which in many problems is very hard to do). For more on causal inference and various paradoxes that can arise from wrongly assuming causality, see chapter 20.

3 Specification Bias

The problem: there are nonlinear relationships (i.e. the functional form of the model is misspecified; violation of the linearity assumption).

The linearity assumption requires that each predictor is linear in parameter to the response. Sometimes nonlinear relationships can be transformed to become linear, e.g. by taking logs of variables.

Solutions:

include nonlinear terms (logs or polynomials)
include interaction terms (if the issue is that \(\beta\) varies)
do model selection to avoid overfitting

E.g. below is a scatterplot of life expectancy vs GDP per capita from the gapminder dataset:

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = .5)

The relationship doesn’t look very linear.

However by log-transforming the predictor and outcome variables we can make the relationship more linear. Here’s a scatterplot of the log-transformed variables:

ggplot(gapminder, aes(x = log(gdpPercap), y = log(lifeExp))) +
  geom_point(alpha = .5)

The log-transformed relationship between life expectancy and GDP per capita is clearly much more linear. For linear regression it is thus more appropriate to use the log-transformed variables, as they will give a better linear fit. You can see this for yourself by comparing the \(R^2\) between the log-transformed regression and the base model.

In general, variables involving money (e.g. GDP) tend to be skewed (due to extreme inequality they often have exponential forms) and so log-transformed variables are often used.

4 Measurement Error Bias

The problem: variables are measured with noise—this dampens LS estimates.

E.g.

\[\text{truth:} \;\;\; {\boldsymbol y} = \beta_0 + \beta_1 {\boldsymbol x} + {\boldsymbol \varepsilon}\] \[\text{data:} \;\;\; {\boldsymbol y} = \beta_0 + \beta_1 {\boldsymbol x}^* + {\boldsymbol \varepsilon}^*\]

where \({\boldsymbol x}^*\) represents a noisy measurement of \({\boldsymbol x}\). Noise could arise because of recording errors in data entry, rounding errors, etc.

Below: “true” data is on the left, “noisy” data is on the right—note how noise dampens the estimated slope.

Proof:

Measurement error can be expressed as follows:

\[x_i^* = x_i + u_i\]

where \(u_i\) (noise) is independent of \(x_i\) and \(\varepsilon_i\) (since it’s just noise).

\[ \begin{aligned} y &= \beta_0 + \beta_1 x_i^* + \beta_1 u_i + \varepsilon_i \\ &= \beta_1 + \beta_1 x_i^* + v_i \end{aligned} \]

where \(v_i\) is the modified error term, \(v_i = \beta_1 u_i + \varepsilon_i\). Under this model the LS estimate for \(\beta_1\) is:

\[ \begin{aligned} b_1 &= \frac{\text{Cov}[x^*, y]}{\text{Var}[x^*]} \\ &= \frac{\text{Cov}[x + u \; , \; \beta_0 + \beta_1 x + \varepsilon]}{\text{Var}[x + u]} \\ &= \frac{\text{Cov}[x \; , \; \beta_0 + \beta_1 x + \varepsilon] + \text{Cov}[u \; , \; \beta_0 + \beta_1 x + \varepsilon]}{\text{Var}[x + u]} \\ &= \frac{\beta_1 \text{Cov}[x,x] + 0}{\text{Var}[x+u]} \\ &= \beta_1 \cdot \frac{\sigma_x^2}{\sigma_x^2 + \sigma_u^2} \\ &= \beta_1 \cdot \text{attenuation} \end{aligned} \]

i.e. the LS estimate converges to the true coefficient multiplied by an attenuation term (which is between 0 and 1). Hence—noise dampens the LS estimate (but preserves its sign).

Note:

\[\frac{\sigma_x^2}{\sigma_x^2 + \sigma_u^2} = \frac{1}{1+\sigma_u^2 / \sigma_x^2}\]

The ratio \(\frac{\sigma_x^2}{\sigma_u^2}\) is called the signal to noise ratio.

The larger the signal/noise ratio, the smaller the bias.

Solution: use a measurement error model.

5 Simultaneity Bias

The problem: \(x\) and \(y\) cause each other. This creates a simultaneous equations model:

\[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\] \[x_i = \gamma_0 + \gamma_1 y_i + \eta_i\]

The LS setimator becomes a weighted average of two components:

\(\beta_1\): the effect of \(x\) on \(y\)
\(\frac{1}{\gamma_1}\): the feedback loop from \(y\) to \(x\) back to \(y\)

This is known as simultaneity bias.

19 Regression Inference Part II

1 Regression Validity

2 Omitted Variable Bias

3 Specification Bias

4 Measurement Error Bias

5 Simultaneity Bias