Linear Regression Part II: Theory

1 Regression Coefficients as Estimates

In module 1 we introduced linear regression as a technique for modeling linearly related variables. We now have the tools to consider regression models with a little more rigour.

Recall the general form of a regression model:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon\]

where \(y\) is the outcome variable, \(x_1, ..., x_k\) are a set of \(k\) predictor variables, \(\beta_1, ..., \beta_k\) are their respective coefficients (\(\beta_0\) is the \(y\)-intercept), and \(\varepsilon\) represents the error terms or residuals of the model.

Recall that in a regression model, the coefficient \(\beta\) on a particular variable describes the effect of that variable on the response variable, while holding the other predictors constant (i.e. while controlling for the effects of the other predictors).

The equation specified above is known as the population regression equation, and the symbol \(\beta\) denotes the population regression coefficient (i.e. the coefficient we would get if we ran the regression on population data). Of course, for a given sample of data, the regression coefficients we compute are only estimates for the true coefficients, and like all other estimates they are subject to the whims of variability. We typically use a different symbol, \(b\), to represent the estimated regression equation:

\[y = b_0 + b_1x_1 + b_2 x_2 + ... + b_k x_k + e\]

where \(b_1, ..., b_k\) are the set of estimated regression coefficients, and \(e\) represents the estimated error terms of the model. This recasts the regression equation as an estimation problem: our goal is to use sample data to estimate the true regression coefficients.

In the terminology of expectation, the regression equation is just another way of saying \(\text{E}[y | x]\), i.e. the expected value of \(y\) given \(x\). In the case of multiple predictors, we can say \(\text{E}[y | x_1, ..., x_k]\).

2 The Method of Least Squares

The technique most commonly used to fit the regression line is known as the method of least squares. This technique involves finding the coefficients \(b\) that minimize the model’s sum of squared errors (SSR).

In a simple (bivariate) regression, it turns out the value for \(b_1\) that minimizes the sum of squared errors is given by:

\[b_1 = r_{xy}\frac{s_y}{s_x}\]

where \(r_{xy}\) is the sample correlation coefficient between \(x\) and \(y\), and \(s_y\) and \(s_x\) are the sample standard deviations of \(y\) and \(x\). The proof for this is below.

2.1 Bivariate Derivation

Given sample data on two variables, \(y\) and \(x\), the regression equation predicting \(y\) given \(x\) is:

\[y_i = b_0 + b_1 x_i + e_i\]

Our goal is to find \(b_0\) and \(b_1\) such that the sum of squared residuals (SSR) of this model is minimized—see the figure on the right.

To do this, first rearrange the regression equation to get the residuals as a function of \(b_0\) and \(b_1\):

\[e_i = e_i(b_0,b_1) = y_i - b_0 - b_1 x_i\]

The squared residuals are simply \((e_i)^2 = (y_i - b_0 - b_1 x_i)^2\), and the sum of squared residuals is:

\[\text{SSR}(b_0,b_1) = \sum_i^n (y_i - b_0 - b_1 x_i)^2\]

To find \(b_0\) for which \(\text{SSR}\) is minimum, set the partial derivative of \(\text{SSR}\) with respect to \(b_0\) equal to zero:

\[\frac{\partial{\text{SSR}}}{\partial b_0} = -2 \sum_i^n(y_i - b_0 - b_1 x_i) = 0\]

Rearrange, and use the fact that \(\frac 1n \sum_i^n y_i = \bar y\), to solve for \(b_0\):

\[ \begin{aligned} \sum_i^n b_0 &= \sum_i^n y_i - \sum_i^n b_1 x_i \\ b_0 &= \frac 1n \sum_i^n y_i - \frac 1n \sum_i^n b_1 x_i \\ b_0 &= \bar y - b_1 \bar x \end{aligned} \]

Next, to find \(b_1\) for which \(\text{SSR}\) is minimum, set the partial derivative of \(\text{SSR}\) with respect to \(b_1\) equal to zero:

\[\frac{\partial{\text{SSR}}}{\partial b_1} = -2 \sum_i^n x_i (y_i - b_0 - b_1 x_i) = 0\]

Rearrange to solve for \(b_1\):

\[ \begin{aligned} \sum_i^n b_1 x_i^2 &= \sum_i^n x_i y_i - \sum_i^n x_i b_0 \\ &= \sum_i^n x_i y_i - \sum_i^n x_i (\bar y - b_1 \bar x) \\ &= \sum_i^n x_i y_i - \sum_i^n x_i \bar y - \sum_i^n x_i b_1 \bar x \\ \end{aligned} \]

\[ \begin{aligned} b_1 \sum_i^n x_i^2 &= \sum_i^n x_i y_i - N \bar x \bar y - b_1 N \bar x^2 \\ b_1 &= \frac{\sum_i^n x_i y_i - N \bar x \bar y}{\sum_i^n x_i^2 - N \bar x^2} \\ b_1 &= \frac{\sum_i^n (x_i - \bar x)(y_i - \bar y)}{\sum_i^n (x_i - \bar x)^2} = \frac{s_{xy}}{s_x^2} = r_{xy} \frac{s_y}{s_x} \end{aligned} \]

where we have used the fact that \(\sum_i^n x_i y_i - N \bar x \bar y = \sum_i^n (x_i - \bar x)(y_i - \bar y)\). Note \(s_{xy}\) denotes the sample covariance between \(x\) and \(y\).

Thus, in a bivariate regression of \(y\) on \(x\), the coefficients that uniquely minimize the model’s sum of squared residuals are given by:

\[b_1 = r_{xy} \frac{s_y}{s_x}\]

\[b_0 = \bar y - b_1 \bar x\]

2.2 Multivariate derivation

Given multivariate sample data, the regression equation predicting \(y\) given the set of \(k\) predictor variables \(x_1, ..., x_k\) is:

\[y_i = b_0 + b_1 x_{i1} + b_2 x_{i2} + ... + b_k x_{ik}\]

Or, in vector notation:

\[{\boldsymbol y} = {\boldsymbol X} {\boldsymbol b} + {\boldsymbol e}\]

\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & ... & x_{1k} \\ 1 & x_{21} & ... & x_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & ... & x_{nk} \end{bmatrix} \cdot \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_k \end{bmatrix} + \begin{bmatrix} e_{1} \\ e_{2} \\ \vdots \\ e_{n} \end{bmatrix} \]

Our goal is to find the coefficients \(b_0, ..., b_k\) such that the SSR of the model is minimized. Note when there are two or more predictors in a regression model, the equation no longer describes a line, but rather a plane—the figure on the right shows an example of a regression model with two predictors, \(x_1\) and \(x_2\), and the fitted regression plane and its residuals. If the model has more than two predictors, we can’t use this 3D visualization directly, but the idea is the same.

Below is the multivariate derivation of least squares regression coefficients—it involves a little linear algebra:

\[{\boldsymbol e} ({\boldsymbol b}) = {\boldsymbol y} - {\boldsymbol X} {\boldsymbol b}\]

\[\text{SSR}({\boldsymbol b}) = {\boldsymbol e}^T {\boldsymbol e} = ({\boldsymbol y} - {\boldsymbol X} {\boldsymbol b})^2 = {\boldsymbol y}^T {\boldsymbol y} - 2 {\boldsymbol y}^T {\boldsymbol X} {\boldsymbol b} + {\boldsymbol b}^T {\boldsymbol X}^T {\boldsymbol X} {\boldsymbol B}\]

\[\text{SSR}({\boldsymbol b})_{\text{min}} = \frac{\partial \text{SSR}({\boldsymbol b})}{\partial {\boldsymbol b}} = -2 {\boldsymbol y}^T {\boldsymbol X} + 2 {\boldsymbol X}^T {\boldsymbol X} {\boldsymbol b}\]

Thus \({\boldsymbol b}\) must satisfy \({\boldsymbol X}^T {\boldsymbol X} {\boldsymbol b} = {\boldsymbol X}^T {\boldsymbol y}\), and:

\[{\boldsymbol b} = ({\boldsymbol X}^T {\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol y}\]

The Assumptions of LS Regression

Linear regression models have a set of assumptions that are required for the method of least squares to work (i.e. in order for LS to deliver unbiased estimates of the regression coefficients). These assumptions concern how the data is produced (the data generating process).

2.3 Linearity

The relationship between \(y\) and each of the \(k\) predictors must be linear. Sometimes it’s possible to transform inherently nonlinear relationships to linear ones using a simple mathematical transformation. Often you’ll see models that contain log-transformed variables or powers of variables, e.g.

\[y = \beta_0 + \beta_1 \log x_1 + \beta_2 \log x_2 + ...\]

This is useful in cases where the relationship between \(y\) and \(x_1\) is not linear, but the relation between \(y\) and \(\log x_1\) is.

The takeaway—the regression equation must be linear in parameters. But it doesn’t necessarily have to be linear in the variables themselves.

2.4 Uncorrelated, Mean-Zero Errors

The expected value of the error terms must be zero, conditional on the predictor(s):

\[\text{E}[\varepsilon | {\boldsymbol X}] = 0\]

Note that \(\text{E}[\varepsilon] = 0\) does not imply \(\text{E}[\varepsilon | {\boldsymbol X}] = 0\)—the latter is a stronger condition. Below is an example of a situation where \(\text{E}[\varepsilon]=0\) (the errors technically have a mean of zero) but \(\text{E}[\varepsilon | {\boldsymbol X}] \neq 0\):

Here it’s clear that changing \(X\) changes the mean of the error terms. LS regression requires that \(X\) can be varied without changing the mean error, i.e. that \(X\) is independent of \(\text{E}[\varepsilon]\).

There are two corollaries to this assumption:

1. The errors are uncorrelated with the regressor(s):

\[\text{Cov}[\varepsilon, {\boldsymbol X}] = 0\]

The above plot shows an example of error terms that are correlated with the regressor, since the mean of \(\varepsilon\) is clearly dependent on the value of \(X\).

2. The errors are uncorrelated with each other:

\[\text{Cov}[\varepsilon_i, \varepsilon_j] = 0\]

When error are correlated with each other, it’s known as autocorrelation. This is when the value of the error at a point depends on the value of the error at a previous point. You can see autocorrelation in a scatterplot–are the errors repeatedly positive, then repeatedly negative, and so on? Below is an example of autocorrelation:

You usually get autocorrelation in time series data—and there are various methods for dealing with it. We will discuss these later.

2.5 Constant-Variance Errors

The variance of the error terms must be constant:

\[\text{Var}[\varepsilon | {\boldsymbol X}] = 0\]

which also implies that:

\[\text{Cov}[\varepsilon_i, \varepsilon_j, {\boldsymbol X} ] = 0\]

i.e. the spread of the error terms must be constant over the whole range of \(X\). Errors with constant variance are known as homoscedastic errors.

Below is an example of a model where the variance of the errors increases with \(X\):

This is an example of heteroscedasticity (errors with non-constant variance), which violates the linear regression assumptions.

2.6 No Multicollinearity

Multicollinearity is when two or more predictors are highly correlated with each other. It becomes difficult for LS regression to compute regression coefficients if predictors are strongly correlated. [The technical reason is that the invertibility of the matrix \({\boldsymbol X}^T {\boldsymbol X}\) requires that it has full rank. If predictors are highly correlated, this is equivalent to there being an exact (or near-exact) linear dependency in the data, which reduces the rank of the matrix.] You can test for multicollinearity by looking at a correlation matrix of the predictors—the best way to deal with multicollinearity is to simply remove the highly correlated predictors.

While LS regression can control for the effect of minor correlations between predictors, it becomes problematic if the correlation is very high. If you run a regression in R and get the error “prediction from a rank-deficient fit may be misleading” or “linear dependency in data”, this is a sign there is multicollinearity in the data.

2.7 Normal Errors

Another assumption is that the error terms are normally distributed with mean zero and variance \(\sigma^2\):

\[\varepsilon | X \sim \mathcal N (0, \sigma^2)\] But this assumption can generally be relaxed when \(n\) is large, since the CLT gives asymptotically normal errors even if they aren’t normally distributed, provided \(n\) is large enough.

3 Asymptotic Properties of the LS Estimator

We can use the CLT to show that when \(n\) is large, the estimated regression coefficients become asymptotically normal, with \(\text{E}[b] = \beta\).

For a simple regression:

\[b \sim \mathcal N \bigg( \beta \;\; , \;\; \frac{\sigma^2}{\sum_i^n (x_i - \bar x)^2} \bigg)\]

For multiple regression:

\[{\boldsymbol b} \sim \mathcal N \bigg( {\boldsymbol \beta }\;\; , \;\; \sigma^2 ({\boldsymbol X}^T {\boldsymbol X})^{-1} \bigg)\]

3.1 Variance of the LS Estimator

In a simple regression, the variance of the regression LS estimator is is:

\[\text{Var}[b_1] = \frac{\sigma^2}{\sum_i^n (x_i - \bar x)^2}\]

The greater the variation in \(x\), the smaller the variance of \(b\), i.e. more variation in the data gives a more precise estimate of \(b\):

The data shown on the right will give a more precise estimate of \(b\).

3.2 The Gauss-Markov theorem

Provided all the regression assumptions are satisfied, the method of least squares gives unbiased estimates of the regression coefficients that have the smallest variance of any other estimator. The Gauss-Markov theorem is as follows:

In a linear model with homoscedastic errors, the LS estimator is BLUE (the Best Linear Unbiased Estimator).

If errors are not homoscedastic, the LS estimator is stll unbiased, provided the errors are uncorrelated and have zero mean. But if this assumption isn’t satisfied, i.e. \(\text{E}[\varepsilon | {\boldsymbol x}] \neq 0\), LS is no longer the smallest variance estimator (see: Generalized Least Squares).