Suppose we have the following regression model predicting crime rate using 10 predictors [download the data here. For more information on the variables go here.]
reg1 = lm(crime_rate ~ age + southern_states + edu + ex0 + ex1 + labor + number_of_males + population + unemployment_14_24 + unemployment_35_39 + wealth, data = uscrime)
summary(reg1)
##
## Call:
## lm(formula = crime_rate ~ age + southern_states + edu + ex0 +
## ex1 + labor + number_of_males + population + unemployment_14_24 +
## unemployment_35_39 + wealth, data = uscrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.581 -15.801 -1.954 11.168 58.203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -571.56959 170.46576 -3.353 0.00193 **
## age 1.02846 0.44964 2.287 0.02834 *
## southern_states 11.50565 13.33563 0.863 0.39413
## edu 1.28443 0.70173 1.830 0.07572 .
## ex0 1.80684 1.18549 1.524 0.13646
## ex1 -0.92128 1.26584 -0.728 0.47158
## labor 0.07785 0.15907 0.489 0.62761
## number_of_males 0.29503 0.22447 1.314 0.19728
## population 0.08485 0.13978 0.607 0.54774
## unemployment_14_24 -0.69195 0.48417 -1.429 0.16183
## unemployment_35_39 2.10467 0.95910 2.194 0.03493 *
## wealth -0.07989 0.09169 -0.871 0.38953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.72 on 35 degrees of freedom
## Multiple R-squared: 0.6892, Adjusted R-squared: 0.5915
## F-statistic: 7.055 on 11 and 35 DF, p-value: 3.952e-06
The question: which variables should we keep/eliminate to improve the model?
The problem: as more predictors are added to a model, its bias decreases and its variability increases. Where possible, smaller models with fewer predictors are preferable to larger ones, as they are simpler and have lower variability.
We want to avoid the two extremes:
A good model achieves the right balance between bias and variance:
Thus finding a good model involves trading between fit and complexity.
We mentioned in chapter 16 that \(\bar R^2\) is a measure of goodness-of-fit that incorporates a penalty on the number of predictors in the model.
\[\bar R^2 = 1-\frac{\text{SSR}(p)}{TSS} - \frac{\text{SSR}(p)}{TSS} \cdot \frac{p}{N-p-1}\]
where \(p\) is the number of predictors in the model.
AIC (Akaike Information Criterion) is a measure of model quality—it can be thought of as goodness-of-fit minus model complexity. AIC can deal with the risks of overfitting and underfitting.
Formally:
\[\text{AIC}(p) = \ln \bigg( \frac{\text{SSR}(p)}{N} \bigg) + (p+1) \frac 2N\]
AIC rewards goodness-of-fit (the first term) but penalizes complexity (the second term, which is a penalty that increases with \(p\)).
Among a set of models, the preferred one is the one that minimizes the AIC score.
Another scoring method for assessing model quality is is BIC (Bayesian Information Criterion).
Formally:
\[\text{BIC}(p) = \ln \bigg( \frac{\text{SSR}(p)}{N} \bigg) + (p+1) \frac{\ln N}{N}\]
BIC also rewards goodness-of-fit and penalizes complexity (BIC penalizes complexity more harshly than AIC).
Among a set of models, the preferred one is the one that minimizes the BIC score.
In the next section we’ll show you a simple way to conduct a model search using AIC or BIC scoring criteria.
Suppose we have \(p\) predictors to choose among. Performing a model search involves searching through all \(2^p\) possible models and selecting the one with the best score. If \(p\) is relatively small, we can do a complete search over all the possible models.
Two common methods are forward stepwise regression and backward stepwise regression.
The following code applies backward stepwise regression to the crime data using AIC:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# full model
reg2 = lm(crime_rate ~ ., data = uscrime)
# backward stepwise regression
reg3 = stepAIC(reg2, direction = "backward")
## Start: AIC=311.66
## crime_rate ~ age + southern_states + edu + ex0 + ex1 + labor +
## number_of_males + population + unemployment_14_24 + unemployment_35_39 +
## wealth
##
## Df Sum of Sq RSS AIC
## - labor 1 146.4 21535 309.98
## - population 1 225.2 21614 310.15
## - ex1 1 323.7 21712 310.37
## - southern_states 1 454.9 21843 310.65
## - wealth 1 463.9 21852 310.67
## <none> 21388 311.66
## - number_of_males 1 1055.7 22444 311.93
## - unemployment_14_24 1 1248.1 22636 312.33
## - ex0 1 1419.6 22808 312.68
## - edu 1 2047.3 23436 313.96
## - unemployment_35_39 1 2942.7 24331 315.72
## - age 1 3197.0 24585 316.21
##
## Step: AIC=309.98
## crime_rate ~ age + southern_states + edu + ex0 + ex1 + number_of_males +
## population + unemployment_14_24 + unemployment_35_39 + wealth
##
## Df Sum of Sq RSS AIC
## - southern_states 1 330.30 21865 308.70
## - population 1 337.86 21873 308.71
## - ex1 1 466.01 22001 308.99
## - wealth 1 514.61 22049 309.09
## <none> 21535 309.98
## - ex0 1 1668.44 23203 311.49
## - unemployment_14_24 1 2030.14 23565 312.22
## - number_of_males 1 2222.39 23757 312.60
## - edu 1 2482.00 24017 313.11
## - unemployment_35_39 1 3059.38 24594 314.23
## - age 1 3128.00 24663 314.36
##
## Step: AIC=308.7
## crime_rate ~ age + edu + ex0 + ex1 + number_of_males + population +
## unemployment_14_24 + unemployment_35_39 + wealth
##
## Df Sum of Sq RSS AIC
## - population 1 309.8 22175 307.36
## - ex1 1 380.9 22246 307.51
## - wealth 1 781.4 22646 308.35
## <none> 21865 308.70
## - ex0 1 1536.3 23401 309.89
## - edu 1 2206.9 24072 311.22
## - number_of_males 1 2239.8 24105 311.28
## - unemployment_14_24 1 2757.2 24622 312.28
## - age 1 3990.4 25855 314.58
## - unemployment_35_39 1 4000.5 25866 314.59
##
## Step: AIC=307.36
## crime_rate ~ age + edu + ex0 + ex1 + number_of_males + unemployment_14_24 +
## unemployment_35_39 + wealth
##
## Df Sum of Sq RSS AIC
## - ex1 1 479.4 22654 306.36
## - wealth 1 817.0 22992 307.06
## <none> 22175 307.36
## - ex0 1 1928.9 24104 309.28
## - number_of_males 1 1932.1 24107 309.29
## - edu 1 2133.5 24308 309.68
## - unemployment_14_24 1 2668.8 24844 310.70
## - age 1 3899.2 26074 312.97
## - unemployment_35_39 1 4103.6 26278 313.34
##
## Step: AIC=306.36
## crime_rate ~ age + edu + ex0 + number_of_males + unemployment_14_24 +
## unemployment_35_39 + wealth
##
## Df Sum of Sq RSS AIC
## - wealth 1 880.7 23535 306.16
## <none> 22654 306.36
## - edu 1 1893.8 24548 308.14
## - number_of_males 1 2435.9 25090 309.16
## - unemployment_14_24 1 2789.2 25443 309.82
## - age 1 3857.5 26512 311.75
## - unemployment_35_39 1 4288.5 26943 312.51
## - ex0 1 13058.5 35713 325.76
##
## Step: AIC=306.16
## crime_rate ~ age + edu + ex0 + number_of_males + unemployment_14_24 +
## unemployment_35_39
##
## Df Sum of Sq RSS AIC
## <none> 23535 306.16
## - edu 1 1090.6 24625 306.29
## - number_of_males 1 2504.3 26039 308.91
## - unemployment_14_24 1 2624.6 26159 309.13
## - unemployment_35_39 1 3950.1 27485 311.45
## - age 1 5610.0 29145 314.20
## - ex0 1 14426.7 37962 326.63
summary(reg3)
##
## Call:
## lm(formula = crime_rate ~ age + edu + ex0 + number_of_males +
## unemployment_14_24 + unemployment_35_39, data = uscrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.462 -14.816 -3.607 14.031 51.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -565.1158 145.3558 -3.888 0.000372 ***
## age 1.2445 0.4030 3.088 0.003656 **
## edu 0.7546 0.5543 1.361 0.180991
## ex0 0.8681 0.1753 4.952 1.38e-05 ***
## number_of_males 0.3392 0.1644 2.063 0.045638 *
## unemployment_14_24 -0.8620 0.4081 -2.112 0.040973 *
## unemployment_35_39 2.3103 0.8916 2.591 0.013289 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.26 on 40 degrees of freedom
## Multiple R-squared: 0.658, Adjusted R-squared: 0.6067
## F-statistic: 12.82 on 6 and 40 DF, p-value: 5.044e-08