Confidence Intervals

1 Point and Interval Estimators

A point estimator is a single plausible value for an unknown population parameter. Any point statistic derived from a sample (e.g. the sample mean, \(\bar X\)) is a point estimator.

An interval estimator is a range of plausible values for an unknown parameter. Two common interval estimators are confidence intervals (a frequentist method) and credible intervals (a Bayesian method).

2 Confidence Intervals

A confidence interval is a range of values, computed from a sample of data, that has an approximate probability of containing the true parameter. Confidence intervals are often more useful than point estimators as they provide a reasonable margin of error when estimating an unknown parameter.

Just as the sample mean is a point estimate of the true mean, a 95% confidence interval for the mean_ s an interval estimate of the true mean, and can be expressed as follows:

\[\text{P}(LB \leq \mu \leq UB) = 0.95\]

where \(LB\) and \(UB\) are the lower and upper bounds of the 95% confidence interval.

Note that since confidence intervals are computed from sample data, they are still estimators; thus for any given interval the 95% confidence level is only an approximate proabibility that the interval contains the true parameter.

3 Confidence Interval for a Mean

Theoretically you can compute a confidence interval for any parameter, e.g. the median, max, min, etc. Here we’ll show how to compute a confidence interval for a mean.

Computing the bounds of a confidence interval requires that you know the distribution of the parameter in question. If you are computing a confidence interval for a mean, you can wield the central limit theorem to your advantage. Recall the CLT says that the sample mean follows a normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\):

\[\bar X \sim \mathcal N \bigg( \mu, \frac{\sigma^2}{n} \bigg)\]

Since the sample mean follows a normal distribution, you can visualize a 95% confidence interval for the mean as follows:

where \(\text{SE}= \frac{\sigma}{\sqrt n}\). In the above graph the shaded region represents the range of values contained by a 95% confidence interval.

In estimation problems the significance level, denoted \(\alpha\), is the probability the true parameter lies outside the confidence interval. For a 95% confidence interval, \(\alpha =\) 0.05.

Since the interval is symmetric, its lower and upper bounds lie at the \(Z\)-statistics corresponding to the 2.5th and 97.5th percentiles of the distribution (i.e. \(Z_{\alpha/2}\) and \(Z_{(1- \alpha/2)}\)). These values are:

c(qnorm(0.025), qnorm(0.975))
## [1] -1.959964  1.959964

i.e. a 95% confidence interval will have a lower and upper bound 1.96 standard errors from the mean (provided the data is normally distributed).

If the true mean and standard deviation are unknown (as is usually the case), and you are using the \(t\)-distribution to approximate the sample mean, the lower and upper bounds of the 95% confidence interval lie at \(t\)-statistic corresponding to the 2.5th and 97.5th percentiles of the \(t\)-distribution (i.e. \(t_{\alpha/2}\) and \(t_{(1- \alpha/2)}\)) with the appropriate degrees of freedom. For DoF = 30, these values are:

c(qt(0.025, df = 30), qt(0.975, df = 30))
## [1] -2.042272  2.042272

i.e. if the mean follows a \(t\)-distribution with 30 DoF, a 95% confidence interval will have a lower and upper bound 2.04 standard errors from the mean.

In general, a confidence interval for the mean can be written:

\[\bar X - c \cdot SE \leq \mu \leq \bar X + c \cdot SE\]

where \(c\) is the test statistic (a measure of how far away the lower and upper bounds are from the mean in units of standard error). The value of \(c\) depends on:

If data follows a normal distribution, a confidence interval for the mean can be expressed:

\[\bar X - Z \cdot \frac{\sigma}{\sqrt n} \leq \mu \leq \bar X + Z \cdot \frac{\sigma}{\sqrt n}\]

where \(Z\) is the \(Z_{(1-\alpha/2)}\)-statistic of a normal distribution.

If data follows a t-distribution, a confidence interval for the mean can be expressed:

\[\bar X - t \cdot \frac{s}{\sqrt n} \leq \mu \leq \bar X + t \cdot \frac{s}{\sqrt n}\]

where \(t\) is the \(t_{(1-\alpha/2)}\)-statistic of a \(t\)-distribution with \(n-1\) degrees of freedom.

Generally for sample data you should use the \(t\)-distribution, since the true mean and s.d. of the population are unknown. However for large \(n\) it doesn’t make much difference, and you can use the normal distribution with plug-in estimates in place of \(\mu\) and \(\sigma\).

3.1 Computing the bounds in practice

In the pay gap data, the variable DiffMeanHourlyPercent has a sample mean \(\bar X =\) 12.356, a sample s.d. \(s =\) 16.009, and a sample size \(n =\) 153.

Suppose we want to compute a 95% confidence interval for the true mean difference in hourly wages. Since \(n > 30\), we can assume the sample mean is approximately normal with:

\[\bar X \sim \mathcal N \bigg( \bar X, \frac{s}{\sqrt n} \bigg) \;\;\; \Longrightarrow \;\;\; \bar X \sim \mathcal N \bigg( 12.356, \frac{16.009}{\sqrt{153}} \bigg)\]

where we have used the CLT with sample values as plug-in estimates. The general form of the confidence interval is:

\[\bar X - c \cdot \frac{s}{\sqrt n} \leq \mu \leq \bar X + c \cdot \frac{s}{\sqrt n}\]

Since we are using the normal distribution to approximate the data, \(c\) in this case is the \(Z_{0.975}\)-statistic of a standard normal distribution. This is:

qnorm(0.975)
## [1] 1.959964

Substituting the values above as plug-in estimates, we get the following confidence interval for the mean:

\[ \begin{aligned} 12.36 - 1.96 \cdot \frac{16.01}{\sqrt{153}} \leq \; &\mu \leq 12.36 + 1.96 \cdot \frac{16.01}{\sqrt{153}} \\ \\ \Longrightarrow \hspace{0.5cm} 9.82 \leq \; &\mu \leq 14.90 \end{aligned} \]

Note if we had used a \(t\)-distribution to approximate the data (which, strictly speaking we should have, since \(\mu\) and \(\sigma\) are unknown), \(c\) would be the \(t_{0.975}\)-value of a \(t\)-distribution with 152 degrees of freedom. This is:

qt(0.975, df = 152)
## [1] 1.975694

Substituting these values gives the following confidence interval:

\[ \begin{aligned} 12.36 - 1.976 \cdot \frac{16.01}{\sqrt{153}} \leq \; &\mu \leq 12.36 + 1.976 \cdot \frac{16.01}{\sqrt{153}} \\ \\ \Longrightarrow \hspace{0.5cm} 9.80 \leq \; &\mu \leq 14.92 \end{aligned} \]

This interval is only trivially different to the one computed using the normal distribution, since \(n\) in this case is large.

But if, for instance, we had only 10 observations in our data, and we had used a \(t\)-distribution with 9 DoF, note how \(c\) is more substantially different:

qt(0.975, df = 9)
## [1] 2.262157

This gives a confidence interval:

\[9.48 \leq \mu \leq 15.24\]

Thus when \(n\) is large, it’s sufficient to use the normal distribution to compute confidence intervals (with the appropriate plug-in estimates), since the difference between the \(t\) and normal distributions is trivial for large \(n\). But when dealing with small samples, you must use the \(t\)-distribution.

3.2 A function for calculating confidence intervals

Currently there is no function in R to compute a confidence interval from a given array of data. You either have to perform the computations manually (as demonstrated above) or write a function yourself.

Below is an example of a user-defined function that can perform all the required computations for a confidence interval for a mean:

confidence_interval = function(data, conflevel) {
  n = length(data)           # sample size 
  xbar = mean(data)          # sample mean 
  SE = sd(data) / sqrt(n)    # standard error
  alpha = 1 - conflevel      # alpha
  
  lb = xbar + qt(alpha/2, df = n-1) * SE    # lower bound
  ub = xbar + qt(1-alpha/2, df = n-1) * SE  # upper bound
  
  cat(paste(c('sample mean =', round(xbar,3), '\n', 
              conflevel*100, '% confidence interval:', '\n', 
              'lower bound =', round(lb,3), '\n', 
              'upper bound =', round(ub,3))))
}

Running this code will define a new function in the environment, confidence_interval(), that takes two arguments: data, a vector array of numeric data, and conflevel, the desired confidence level. It then computes a confidence interval for the mean at the desired confidence level.

You can use this function to compute a 95% confidence interval for the mean difference in hourly wages:

confidence_interval(paygap$DiffMeanHourlyPercent, 0.95)
## sample mean = 12.356 
##  95 % confidence interval: 
##  lower bound = 9.799 
##  upper bound = 14.913

A 99% confidence interval for the same parameter:

confidence_interval(paygap$DiffMeanHourlyPercent, 0.99)
## sample mean = 12.356 
##  99 % confidence interval: 
##  lower bound = 8.98 
##  upper bound = 15.732

4 Misconceptions About Confidence Intervals

It’s important to remember that confidence intervals are computed from sample data. This means that different samples of data will yield different confidence intervals, as is the case for point estimates. This is why the confidence level describes only the approximate probability the interval contains the true parameter. For any given interval, it either contains the true parameter or it doesn’t; there is no way to tell which.

Recall from chapter 7 that probability refers to the relative frequency of an event in a large number of trials (in the frequentist view, anyway.) Thus the true interpretation of a 95% confidence interval is as follows:

A 95% confidence interval is a range of values where, if you repeated the experiment many times, approximately 95% of the confidence intervals generated will contain the true parameter.

This is the “proper” definition of a confidence interval. Some misconceptions:

Confidence intervals exhibit variability across different samples, in the same way that point estimates do. The following code demonstrates this by generating 20 random samples and plotting a 95% confidence interval for the mean based on each sample. The pink line is the true mean of the population.

set.seed(12)

box = c(1,1,1,5)
n = 50
X = sample(box, n, replace = TRUE)
Xbar = mean(X)
SE = sd(X)/sqrt(n)
c = qt(0.975, n-1)

experiment = function() {
  X = sample(box, n, replace = TRUE)
  Xbar = mean(X)
  SE = sd(X)/sqrt(n)
  c(Xbar - c*SE, Xbar + c*SE)
}

CIs = data.frame(t(replicate(20, experiment())))
names(CIs) = c("lower", "upper")
CIs$SampleNumber = 1:20

ggplot() + 
  geom_errorbar(data = CIs, aes(x = as.factor(SampleNumber), ymin = lower, ymax = upper)) +
  geom_hline(yintercept = 2, color = 'violetred') + 
  coord_flip() + ylab('95% confidence interval') + xlab('sample number')

i.e. there is clear variability in the confidence intervals across the samples. Note that one of the 20 intervals does not contain the true mean (sample #10). This is a clear illustration of the 95% probability associated with the interval: across 20 different samples we expect that only 95% of the intervals actually contain the true mean.

In general, without knowing the true parameter, there is no way to know whether a confidence interval generated from a particular sample actually contains the true parameter or not.

Nontheless, confidence intervals are still useful in estimation, since they provide a wider range of plausible values for the true parameter than a point estimate does, and they can characterize the uncertainty in an estimate.