Regression Inference Part I

1 Confidence intervals for regression coefficients

Sample-derived regression coefficients are point estimates of the true regression coefficients. In the same way that we construct confidence intervals for means, we can also construct confidence intervals for regression coefficients, as interval estimates of the true coefficients.

Consider the simple regression model of temperature on building height from the NYC heatwave data:

reg1 = lm(temperature ~ building_height, data = nycheat)
summary(reg1)
## 
## Call:
## lm(formula = temperature ~ building_height, data = nycheat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0098  -2.9972   0.6057   3.1316  13.2119 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     94.69398    0.25672  368.86   <2e-16 ***
## building_height  0.48487    0.04609   10.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.77 on 1019 degrees of freedom
## Multiple R-squared:  0.09796,    Adjusted R-squared:  0.09708 
## F-statistic: 110.7 on 1 and 1019 DF,  p-value: < 2.2e-16

The estimated coefficient for building height is 0.48. The column Std. Error gives the standard error of this estimate. Recall that using the standard error we can construct a 95% confidence interval on a point estimate, with the lower and upper bounds approximately \(2 \; \text{SE}\) below and above the estimate.

In R you can use the confint() function to compute confidence intervals for the coefficients of a regression model:

confint(reg1)
##                     2.5 %     97.5 %
## (Intercept)     94.190215 95.1977466
## building_height  0.394424  0.5753111

To get a 99% confidence interval:

confint(reg1, level = 0.99)
##                     0.5 %     99.5 %
## (Intercept)     94.031466 95.3564957
## building_height  0.365923  0.6038121

1.1 Generating a prediction interval

After we’ve specified a regression model (i.e. estimated values for the coefficients), we can use it to predict the value of the response variable for given values of the predictors. The predicted value of a regression model, denoted \(\skew{3}\hat{y}\), is a point estimate of the model’s prediction. We can generate an interval estimate of the model’s prediction using the residual standard error.

Consider the simple regression model of temperature on albedo, as depicted below, with the regression line overlaid in blue:

The CLT allows us to model the errors of a regression model as normally distributed with mean 0 and standard deviation equal to the residual standard error. The residual standard error is assumed to be constant (homoscedastic errors). In the above plot, the distribution of the residuals is depicted in red at two points in the data. Below is the regression output for temperature on albedo:

reg2 = lm(temperature ~ albedo, data = nycheat)
summary(reg2)
## 
## Call:
## lm(formula = temperature ~ albedo, data = nycheat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6862  -2.2114   0.2867   2.5675  14.2978 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.45677    0.22419   452.6   <2e-16 ***
## albedo       -0.49191    0.02007   -24.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.984 on 1019 degrees of freedom
## Multiple R-squared:  0.3708, Adjusted R-squared:  0.3702 
## F-statistic: 600.5 on 1 and 1019 DF,  p-value: < 2.2e-16

The third line from the bottom shows the residual standard error of this model: 3.984. Using this we can generate a 95% prediction interval for the temperature for a given value of albedo. You can either do this manually (using the standard expression for a confidence interval) or you can use the predict() function in R.

E.g. below is a 95% prediction interval for the value of temperature when albedo = 10:

predict(reg2, newdata = data.frame(albedo=10), interval = 'predict', level = 0.95)
##        fit     lwr      upr
## 1 96.53766 88.7158 104.3595

Here fit refers to the fitted value of the model, and lwr and upr are the lower and upper bounds of the prediction interval.

2 Goodness of Fit

In reegression models we often want to characterize how well the model “fits” the data. A useful statistic is the coefficient of determination, denoted \(R^2\), which is defined as the proportion of variability in the response variable that is explained by the model.

In the regression output, the second line from the bottom gives two measures of \(R^2\): the multiple \(R^2\) and the adjusted \(R^2\). These can be extracted manually from the regression summary as follows:

## r squared
summary(reg2)$r.squared
## [1] 0.3707936
## adjusted r squared
summary(reg2)$adj.r.squared
## [1] 0.3701761

i.e. in the simple regression of temperature on albedo, approximately 37% of the variability in temperature is explained by variable albedo.

To understand how the \(R^2\) value is computed, it’s useful to introduce a few terms that describe the variability in a regression model:

The total sum of squares, or TSS, represents the variability in the response variable:

\[\text{TSS} = \sum_i (y_i - \bar y)^2\]

The regression sum of squares, or RSS, represents the variability in the predicted values of the model:

\[\text{RSS} = \sum_i (\skew{3}\hat{y} - \bar y)^2\]

The sum of squared residuals, or SSR (aka sum of squared errors, SSE), represents the variability in the residuals of the model:

\[\text{SSR} = \sum_i e_i \]

Using these, the coefficient of determination, \(R^2\), is defined as follows:

\[R^2 = \frac{RSS}{TSS} = 1 - \frac{SSR}{TSS}\]

i.e. the \(R^2\) value is the regression sum of squares divided by the total sum of squares—or the proportion of variability in the response variable that is explained by the model (the predictor variables). This is why \(R^2\) is often called a measure of a model’s goodness-of-fit.

Adding more relevant predictors to a regression model can increase the model’s overall goodness-of-fit. We saw above that in the simple regression of temperature on albedo, 37% of the variability in the response variable is explained by the model. Now let’s consider a multiple regression model for temperature, using the predictors vegetation, albedo, building height, and population density. Watch what happens to the \(R^2\):

reg3 = lm(temperature ~ vegetation + albedo + building_height + pop_density, data = nycheat)
summary(reg3)
## 
## Call:
## lm(formula = temperature ~ vegetation + albedo + building_height + 
##     pop_density, data = nycheat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6189  -2.1516   0.1683   2.0384  13.5973 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.021e+02  5.458e-01 187.050  < 2e-16 ***
## vegetation      -1.292e+01  8.159e-01 -15.831  < 2e-16 ***
## albedo          -3.667e-01  1.866e-02 -19.646  < 2e-16 ***
## building_height  1.685e-01  3.542e-02   4.758 2.24e-06 ***
## pop_density      2.283e-05  9.895e-06   2.307   0.0212 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.444 on 1016 degrees of freedom
## Multiple R-squared:  0.5311, Adjusted R-squared:  0.5292 
## F-statistic: 287.7 on 4 and 1016 DF,  p-value: < 2.2e-16

i.e. adding the three predictors increased the \(R^2\) to 0.53, suggesting these three predictors help explain the variation in the response variable, and are thus useful to the model.

One pitfall of \(R^2\) is that by definition it increases any time a predictor is added to the model, regardless of how useful or significant the predictor actually is to the model. Of course, the more useful the predictor, the larger the subsequent increase in \(R^2\). But it turns out \(R^2\) still increases marginally even when adding a non-useful predictor. This can sometimes give an overinflated sense of how “good” a model really is—you could theoretically add thousands of irrelevant predictors to a model, resulting in a very high \(R^2\) value even though the model in reality is not quite so impressive (this is known as overfitting a model).

A solution to this problem is provided by the adjusted \(R^2\), often denoted \(\bar R^2\), which incorporates a penalty on the number of predictors in the model. The adjusted \(R^2\) is defined as follows:

\[\bar R^2 = 1-(1-R^2) \frac{n-1}{n-k-1}\]

where \(k\) is the number of predictor variables in the model. The adjusted \(R^2\) only increases if the contribution of the new variable more than offsets the correction for the loss of a DoF. In general you should always use the adjusted \(R^2\) when evaluating multiple regression models.

3 Hypothesis tests on regression coefficients

We can use the hypothesis testing framework to determine whether each of the predictor variables in a given model is really significant. Consider the multiple regression model for temperature, using vegetation, albedo, building height, and population density as predictors:

reg3 = lm(temperature ~ vegetation + albedo + building_height + pop_density, data = nycheat)
summary(reg3)
## 
## Call:
## lm(formula = temperature ~ vegetation + albedo + building_height + 
##     pop_density, data = nycheat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6189  -2.1516   0.1683   2.0384  13.5973 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.021e+02  5.458e-01 187.050  < 2e-16 ***
## vegetation      -1.292e+01  8.159e-01 -15.831  < 2e-16 ***
## albedo          -3.667e-01  1.866e-02 -19.646  < 2e-16 ***
## building_height  1.685e-01  3.542e-02   4.758 2.24e-06 ***
## pop_density      2.283e-05  9.895e-06   2.307   0.0212 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.444 on 1016 degrees of freedom
## Multiple R-squared:  0.5311, Adjusted R-squared:  0.5292 
## F-statistic: 287.7 on 4 and 1016 DF,  p-value: < 2.2e-16

To test whether the observed coefficient on a predictor is significant, we first define the the null hypothesis that the true coefficient is zero (i.e. the predictor has no effect on the response):

\[H_0: \beta = 0\] \[H_1: \beta \neq 0\]

The \(t\)-statistic for each observed coefficient is defined as follows:

\[t = \frac{b_k - \beta_k}{\sqrt{\frac{s^2}{\sum_i^n (X_i - \bar X)}}}\]

where \(b_k\) is the least squares coefficient on the \(k\)th predictor, \(\beta_k\) is the proposed null coefficient for the \(k\)th predictor (which is zero), and \(t\) follows a \(t\)-distribution with \(n-k\) degrees of freedom. The observed \(t\)-statistic for each predictor is given in the column t value.

The \(p\)-value represents the likelihood of getting a value as extreme as the observed coefficient if the null were true (i.e. if the coefficient were really zero). The observed \(p\)-value for each predictor is given in the column Pr(>|t|). In the output above, you can see the first three predictors have very low \(p\)-values, suggesting we can easily reject the null and assume the coefficients on these predictors are significant. Note how the \(p\)-value for pop_density is larger than the others, at 0.02, and is thus significant only at the 5% level.

3.1 Testing joint hypotheses with the \(F\)-test

Performing independent tests on each coefficient, as demonstrated above, is useful for determing whether individual predictors are significant. But sometimes we want to test whether the model as a whole is significant—this involves jointly testing the hypotheses \(\beta_1 = \beta_2 = ... = \beta_k = 0\) against the alternative that at least one of the coefficients is \(\neq 0\). Testing a joint hypothesis with independent tests will not give the correct type I error (due to the multiple testing problem), and moreover, things get especially messy if the \(\beta\)’s are correlated (which they may well be, in a multiple regression model). To resolve this issue we need a test statistic that combines the hypotheses to perform a joint hypothesis test (as opposed to multiple individual tests).

The \(F\)-statistic serves exactly this purpose. When there are several groups (or parameters) to be tested jointly, the \(F\)-statistic compares the between-group variability (the explained variability) to the within-group variability (the unexplained variability). If the former is much larger than the latter, it implies the model is overall useful (the larger the \(F\)-statistic, the higher the proportion of explained variability in the model). The \(F\)-statistic follows the \(F\)-distribution with the two degrees of freedom parameters \(d_1 = k-1\) and \(d_2 = n-k\), where \(n\) is sample size and \(k\) is the number of predictors.

In the multiple regression model above, the \(F\)-statistic is 287.7:

summary(reg3)$fstatistic
##     value     numdf     dendf 
##  287.6749    4.0000 1016.0000

Not only is this \(F\)-statistic large, but the associated \(p\)-value is small (see the regression output above), suggesting the model is overall useful (we reject the null hypothesis and conclude that at least one of the predictors is significant).

3.2 Comparing nested models with ANOVA

One useful application of the \(F\)-test is in ANOVA (Analysis Of Variance) problems to determine which of a series of nested models is best. A nested model is a subset or restriction of a larger model, e.g. 

## full model
reg4 = lm(temperature ~ vegetation + albedo + building_height + pop_density, data = nycheat)

## restricted model
reg5 = lm(temperature ~ vegetation + albedo + building_height, data = nycheat)

In this case the model reg5 is a nested model since it loses one predictor, pop_density, when compared to the “full” model. We can use the \(F\)-test to determine whether the extra predictor(s) in the full model provide a significant improvement over the simpler model (if not, we prefer the simpler model, as it contains fewer irrelevant variables).

The \(F\)-statistic for comparing these two models is:

\[F = \frac{\bigg( \frac{SSR_{\text{sub}}-SSR_{\text{full}}}{q-k}\bigg)}{\bigg(\frac{SSR_{\text{full}}}{n-q} \bigg)}\]

where \(q\) is the number of predictors in the full model and \(k\) is the number of predictors in the restricted model.

In R we can use the anova() function to compare the two models:

anova(reg4, reg5)
## Analysis of Variance Table
## 
## Model 1: temperature ~ vegetation + albedo + building_height + pop_density
## Model 2: temperature ~ vegetation + albedo + building_height
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1   1016 12054                              
## 2   1017 12117 -1   -63.152 5.3229 0.02125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA output suggests that the extra predictor in the full model provides a significant improvement over the restricted model, with a \(p\)-value of 0.02.

4 Checking Residuals and Outliers

The autoplot() function from the ggfortify package provides these four diagnostic plots which are useful for checking the residuals for unusual patterns and outliers in the data:

Here’s what to look for in each of the plots:

Residuals vs Fitted Values (top left)

Normal Q-Q (top right)

Scale Location (bottom left)

Residuals vs Leverage (bottom right)

5 Multicollinearity

Here we run a simulation showing how multicollinearity in the data can affect the variability in the estimated regression coefficients.

In the code below, the function collinear_datagen() generates a data for a regression model with the equation:

\[y = x_1 - 2x_2 + 3x_3\]

where the variable \(x_3\) is linearly dependent on \(x_1\) and \(x_2\) (more so with the \(x_1\) than \(x_2\)—see the code). The constant r determines the extent of the multicollinearity—the smaller it is, the higher the multicollinearity between the variables.

The function sim runs the simulation N times and saves the regression coefficients, their standard errors, and their \(p\)-values. Below are histograms showing the variability in the estimated coefficients \(b_1\), \(b_2\), and \(b_3\), under two values of \(r\): 0.1 (lower multicollinearity) and 0.01 (higher multicollinearity).

## generate collinear data
collinear_datagen = function(n,r) {
  
  x1 = rnorm(n)
  x2 = rnorm(n)
  x3 = 0.8*x1 + 0.2*x2 + sqrt(r)*rnorm(n)
  y = x1 - 2*x2 + 3*x3 + rnorm(n)
  
  sample = data.frame(y,x1,x2,x3)
  
  model = summary(lm(y ~ x1 + x2 + x3, data = sample))
  modeldata = as.list(data.frame(model$coefficients[-1,c(1,2,4)]))
}

## run simulation
sim = function(N,n,r) {
  
  simulation = replicate(N, collinear_datagen(n,r))
  betas = do.call(rbind, simulation[1,])
  SEs =  do.call(rbind, simulation[2,])
  pvals =  do.call(rbind, simulation[3,])
  results = data.frame(cbind(betas, SEs, pvals))
  names(results) = c("b_1", "b_2", "b_3", "SE_1", "SE_2", "SE_3", "pval_1", "pval_2", "pval_3")
  results
  
}

## save simulation data
mc1 = sim(500,100,0.1) #coefficient simulation with lower multicollinearity
mc01 = sim(500,100,0.01) #coefficient simulation with higher multicollinearity 
mc1$r = "0.1"
mc01$r = "0.01"

Rsq1 = sim(500,100,0.1)
Rsq01 = sim(500,100,0.01)
Rsq1$r = "0.1"
Rsq01$r = "0.01"

mc = rbind(mc1,mc01)

## plots
plot1 = ggplot(mc, aes(b_1)) + geom_histogram(bins=30) + xlab(TeX("$b_1$")) + facet_grid(r~.) + geom_vline(xintercept = 1) 
plot2 = ggplot(mc, aes(b_2)) + geom_histogram(bins=30) + xlab(TeX("$b_2$")) + facet_grid(r~.) + geom_vline(xintercept = -2) 
plot3 = ggplot(mc, aes(b_3)) + geom_histogram(bins=30) + xlab(TeX("$b_3$")) + facet_grid(r~.) + geom_vline(xintercept = 3) 

grid.arrange(plot1, plot2, plot3, ncol=3)

The vertical lines on the plots show the population coefficients—which are \(\beta_1 = 1\), \(\beta_2 = -2\), and \(\beta_3 = 3\). It’s clear from these plots that the data with higher multicollinearity (with r = 0.01) has higher variability in its coefficient estimates.

Below is a correlation matrix that summarizes the pairwise correlations between the variables \(x_1\), \(x_2\), and \(x_3\). Notice how \(x_1\) and \(x_3\) are strongly correlated with each other:

kable(cor(mc[ ,1:3]))
b_1 b_2 b_3
b_1 1.0000000 0.8134041 -0.9862830
b_2 0.8134041 1.0000000 -0.8250027
b_3 -0.9862830 -0.8250027 1.0000000

To see how this affects power, below is a table summarizing the power of each cofficient under \(r=0.01\) and \(r=0.1\):

# see how this is reflected in the power
power = mc %>% group_by(r) %>% summarize(power_b1 = mean(pval_1 < 0.05), power_b2 = mean(pval_2 < 0.05), power_b3 = mean(pval_3 < 0.05))
kable(power)
r power_b1 power_b2 power_b3
0.01 0.202 1 0.85
0.1 0.954 1 1.00