How good is an estimator overall? If \(\hat\theta\) is an estimator for \(\theta\), one way to quantify the overall discrepancy between \(\hat\theta\) and \(\theta\) is using a loss function, \(L(\theta, \hat\theta)\). Two examples of loss functions:
\[L(\theta, \hat\theta) = | \hat\theta - \theta | \hspace{2cm} \text{absolute error loss}\] \[L(\theta, \hat\theta) = ( \hat\theta - \theta )^2 \hspace{2cm} \text{squared error loss}\]
To assess an estimator we can compute the average loss, which we denote the risk of an estimator:
\[R(\theta,\hat\theta) = \text{E}\big[ L(\theta,\hat\theta) \big]\]
A common measure of an estimator’s overall quality is the mean squared error, MSE, defined:
\[\text{MSE}(\hat\theta) = \text{E}\big[ (\hat\theta - \theta)^2 \big]\]
i.e. the MSE is the expected value of the estimator’s squared error loss—it’s a measure of the average squared distance between the estimator and the true value.
We know that bias and variance both contribute to the overall error of an estimator. By expanding the form given above, it’s possible to show that the MSE is a combination of both:
\[ \begin{aligned} \text{MSE}(\hat\theta) &= \text{E}\big[ (\hat\theta - \theta)^2 \big] \\ &= \text{E}\big[ (\hat\theta - \text{E}[\hat\theta] + \text{E}[\hat\theta] - \theta)^2] \\ &\; \; \vdots \\ &= \text{E}\big[(\hat\theta - \text{E}[\hat\theta])^2] + \big( \text{E}[\hat\theta]-\theta \big)^2 \\ &= \text{Var}[\hat\theta] + \text{Bias}[\hat\theta]^2 \end{aligned} \]
i.e. the MSE can be expressed as the sum of the variance and squared bias of the estimator. If an estimator is unbiased, its MSE is simply equal to its variance.
Note the difference between bias and variance:
\[\text{Var}[\hat\theta] = \text{E}\big[ (\hat\theta - \text{E}[\hat\theta])^2 \big] \hspace{1cm} \text{Bias}[\hat\theta] = \text{E}[\hat\theta] - \theta\]
i.e. variance is a measure of the spread of an estimate (how closely the data points are clustered), and bias is a measure of how far the cluster actually is from the true value. Below is an illustration of different combinations of bias and variance:
The ideal estimator would have low bias and low variance. It turns out though that it’s not always possible to minimize both—there are many cases where an estimator with low variance has high bias, and vice versa. You’ll come to see how sometimes it’s more optimal to introduce a little bias, as this can result in a better overall estimator.
Choosing the ideal estimator is part of statistical decision theory.
The Gauss-Markov theorem says that in a regression model with homoscedastic errors, the least squares estimator is BLUE (the Best Linear Unbiased Estimator). It’s dubbed the “best” estimator since it’s unbiased and also has the smallest possible variance (among all other unbiased estimators).
The multivariate LS coefficient vector is given (from chapter 15):
\[{\boldsymbol b} = ({\boldsymbol X}^T {\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol y}\]
Substituting \({\boldsymbol y} = {\boldsymbol X} {\boldsymbol \beta }+ {\boldsymbol \varepsilon}\):
\[ \begin{aligned} {\boldsymbol b} &= ({\boldsymbol X}^T {\boldsymbol X})^{-1} {\boldsymbol X}^T ({\boldsymbol X} {\boldsymbol \beta }+ {\boldsymbol \varepsilon}) \\ &= {\boldsymbol \beta }+ ({\boldsymbol X}^T {\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol \varepsilon} \end{aligned} \]
Thus:
\[\text{E}[{\boldsymbol b} | {\boldsymbol X}] = {\boldsymbol \beta }+ \text{E}\big[ ({\boldsymbol X}^T {\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol \varepsilon }| {\boldsymbol X} \big]\]
But one of the assumptions of least squares regression is that the residuals are uncorrelated with the predictors, i.e. \({\boldsymbol \varepsilon }| {\boldsymbol X} = 0\), which makes the second term go to zero, leaving
\[\text{E}[{\boldsymbol b} | {\boldsymbol X}] = {\boldsymbol \beta}\]
i.e. the LS estimator \({\boldsymbol b}\) is an unbiased estimate of \({\boldsymbol \beta}\).
Caveat: the LS estimator gives unbiased estimates of the true regression coefficients in theory, but the unbiasedness is not so obvious if the data is a single sample of a population. You also have to look at the sampling strategy and the experiment design—if the sample is biased then the estimates will be biased by default.
See this link. It shows that the LS estimator is has the lowest possible variance among all linear unbiased estimators.
However the LS estimator is not the minimum variance estimator in general—it turns out there is a biased estimator with an even smaller variance, known as the James-Stein estimator (next).
Suppose the parameters of interest are the mean of a multivariate normal \(\theta = \mu = (\mu_1, ..., \mu_p)\).
Suppose \(X\) is multivariate normal with mean \(\mu\). Thus \(E[X] = \mu\) is unbiased. What about MSE?
If \(p=1\) or \(p=2\), \(X\) has the lowest MSE.
If \(p \geq 3\) then \(X\) no longer has the lowest MSE.
This is sometimes called Stein’s paradox after Charles Stein.
In high dimensions, it can often be better to be biased. Read more on this here.