Central Limit Theorem

1 The Central Limit Theorem

In chapter 10 we demonstrated three important properties about the distribution of a sample mean:

These facts can be summarized in the following statement, called the central limit theorem:

If \(X_1, X_2, ... X_n\) are i.i.d. random variables, and \(n\) is large enough, the distribution of the sample mean becomes approximately normal, with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\):

\[\bar X \sim \mathcal N \bigg( \mu, \frac{\sigma^2}{n} \bigg)\]

1.1 A semantic conundrum

Don’t conflate the terms population distribution, sample distribution, and sampling distribution. They mean different things:

The population distribution is the true/theoretical distribution of the underlying population. Each RV should theoretically follow this distribution. It does not necessarily have to be normal (or even known) for us to make inferences about it.

A sample distribution is the distribution of observations in a single sample of data.

A sampling distribution is the distribution of a sample statistic (e.g. the sample mean \(\bar X\)) across several different samples. The key claim of the CLT is that the distribution of a sample mean can be approximated by a normal curve with \(E[\bar X] = \mu\) and \(\text{Var}[\bar X] = \sigma^2 / n\), even if the population distribution is not normal.

The plots below illustrate the difference between the three distributions when the RV is a dice roll:

Below are summary statistics for each distribution:

distribution mean sd
population 3.500 1.708
one random sample 3.570 1.777
sampling distribution of the mean 3.497 0.171

You can see the s.d. of the sampling distribution is smaller the population s.d. by a factor of roughly 10. This makes sense, since in this case we used \(n=100\), and the CLT thus predicts the s.d. of the sample mean to be \(\sigma / \sqrt n = \sigma / \sqrt{100}\).

1.2 Where the CLT comes from

The CLT describes a convolution of the densities of the individual RVs. Whatever the shape of the population distribution, when \(n\) is large enough the joint density of many RVs will converge to a normal.

1.3 Conditions of the CLT

The CLT should be valid for any random variable under the following conditions:

  • the sample observations are i.i.d.
  • the sample size is large: \(n > 30\) is a good rule of thumb, but strongly skewed population distributions may require even larger \(n\)

Note if the original RVs are normally distributed, then the sampling distribution is exactly normal, no matter what \(n\) is. The CLT describes an approximate convergence to normality for any set of RVs, but the convergence is exact when the RVs are normal.

1.4 Chebyshev’s inequality

Under the CLT, the dispersion of the sampling distribution decreases with factor \(1 / \sqrt n\).

An alternative convergence condition for probability distributions is provided by Chebyshev’s inequality:

\[P \bigg( \big| \bar X - \mu \big| \geq k \frac{\sigma}{\sqrt n} \bigg) \leq \frac{1}{k^2}\]

Unlike the CLT, Chebyshev’s inequality works for any probability distribution (with large enough \(n\)). It’s useful in situations where the conditions of the CLT are not met (the CLT has stronger conditions than Chebyshev’s inequality, e.g. independence).

2 Using the CLT in Practice

Under the CLT:

\[\bar X \sim \mathcal N \bigg( \mu, \frac{\sigma^2}{n} \bigg)\]

Under this distribution, the \(Z\)-statistic of an observed sample mean \(\bar X\) is:

\[Z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}\]

This can be used to calculate probabilities associated with any observed sample mean in a given experiment.

In many situations you only have a sample, and don’t know the population mean and variance. This doesn’t prevent you from using the CLT—it turns out you can simply substitute your sample values of mean and variance instead, as plug-in estimates.

2.1 The plug-in principle

According to the plug-in principle, the features of a population distribution can be approximated by the same features of a sample distribution.

Using plug-in estimates of mean and variance, the CLT can be restated as:

\[\bar X \sim \mathcal N \bigg( \bar X, \frac{s^2}{n} \bigg) \hspace{0.5cm} \text{or} \hspace{0.5cm} \bar X \sim \mathcal N \big( \bar X, \text{SE}^2 \big)\]

where \(\bar X\) is the sample mean, \(s^2\) is the sample variance, and \(\text{SE}= \frac{s}{\sqrt n}\).

2.2 Probability calculations

With the plug-in principle, it’s easy to perform probability calculations with a sample distribution.

Using the pay gap data (from chapter 4), below is the sample distribution of the variable DiffMeanHourlyPercent (percentage difference in hourly wages between women and men):

ggplot(aes(x = DiffMeanHourlyPercent), data = paygap) +
  geom_histogram(bins = 50, aes(y = ..density..)) +
  xlab('% difference in mean hourly wages') 

For this sample distribution, \(\bar X = 12.354\), \(s = 12.556\), \(n = 151\), and \(\text{SE}= \frac{s}{\sqrt n} = 1.022\).

Using these as plug-in estimates for the CLT, you can construct the following normal approximation for the distribution of \(\bar X\):

\[\bar X \sim \mathcal N \big( \bar X, \text{SE}^2 \big) \hspace{0.3cm} \Longrightarrow \hspace{0.3cm}\bar X \sim \mathcal N \big( 12.354, 1.022^2 \big)\]

You can visualize this normal approximation as follows:

Now you can use pnorm() to make probability calculations.

The probability that the mean hourly wage difference is less than 10.5:

Xbar = mean(paygap$DiffMeanHourlyPercent)  # sample mean
s = sd(paygap$DiffMeanHourlyPercent)  # sample s.d.
n = nrow(paygap)  # sample size
SE = s/sqrt(n)  # standard error

pnorm(10.5, mean = Xbar, sd = SE)
## [1] 0.03477057

The probability that the mean hourly wage difference is between 10 and 13:

pnorm(13, mean = Xbar, sd = SE) - pnorm(11, mean = Xbar, sd = SE)
## [1] 0.6437967

3 When Sample Size is Small

What do you do when \(n < 30\)? (asides from getting more data)

3.1 The degrees-of-freedom adjustment

In previous chapters we mentioned that there are slightly different formulae for the sample s.d. and population s.d.:

\[s = \sqrt{\frac{1}{n-1} \sum_i^n (X_i - \bar X)^2} \hspace{1cm} \sigma = \sqrt{\frac 1n \sum_i^n (X_i - \bar X)^2}\]

The use of \(n-1\) instead of \(n\) in the formula for sample s.d. is called a degrees of freedom adjustment. When data is a sample, it turns out that dividing by \(n\) underestimates the true standard deviation. Read more about this here. This bias can be rectified by dividing by \(n-1\) instead.

Strictly speaking, you should always use the DoF-adjusted formula any time you are using sample data. But in reality, when \(n\) is large, the difference between \(1 / n\) and \(1 / (n - 1)\) is negligibly small, so using either formula is acceptable. Note the sd() in R function uses the DoF-adjustment formula by default.

3.2 The t-distribution

When sample size is small, the normal distribution no longer provides a valid approximation for the sample mean under the CLT. But it turns out there is another distribution, similar to the normal, but designed specifically to handle small sample sizes: the \(t\)-distribution.

The \(t\)-distribution is symmetric and bell-shaped, like the normal. The crucial difference between the two is the \(t\)-distribution uses the DoF-adjusted standard deviation to calculate probabilities, meaning it has heavier tails than the normal. The \(t\)-distribution is designed to approximate the limiting behavior of the sample mean when the sample size is small (\(n<30\)).

Just as the normal distribution produces a \(Z\)-statistic, calculated \(Z = \frac{\bar X - \mu}{\sigma / \sqrt n}\), the \(t\)-distribution produces a \(t\)-statistic:

\[t = \frac{\bar X - \mu}{\frac{s}{\sqrt n}}\]

where \(s\) is the DoF-adjusted sample standard deviation.

The \(t\)-distribution has only one parameter: \(\text{DoF}\) (degrees of freedom), where \(\text{DoF} = n-1\) (and \(n\) is the sample size).

E.g. if your sample has 12 observations (i.e. \(n=12\)), you should use the \(t\)-distribution with \(\text{DoF} = 11\). If \(n=20\), use the \(t\)-distribution with \(\text{DoF} = 19\), and so on.

The plots below show the pdf of the \(t\)-distribution for different DoFs, with the normal curve overlaid in black for comparison:

Note that for \(\text{DoF}=30\), the \(t\)-distribution is almost exactly the same as the normal curve. The difference between the two curves is only significant for small \(n\). This is why if your sample is large enough, you needn’t worry about the using \(t\)-distribution—the distinction only becomes important for small samples.

In R you can use pt() to compute probabilities assoociated with a \(t\)-distribution. E.g. the probability that \(t < -2\) under a \(t\)-distribution with 10 DoF:

pt(-2, df = 10)
## [1] 0.03669402

Compare this to the probability that \(Z < -2\) under a standard normal distribution:

pnorm(-2)
## [1] 0.02275013

i.e. the \(t\)-distribution has fatter tails than the equivalent normal, and produces more conservative estimates of probability. With 30 DoF, the probability that \(t < -2\) becomes:

pt(-2, df = 30)
## [1] 0.02731252

i.e. when \(n = 30\) the \(t\)-distribution is essentially the same as the normal.

3.3 When to use the t-distribution

  • if sample size is small \(n < 30\)
  • the population mean \(\mu\) and variance \(\sigma^2\) are unknown
  • the RVs are i.i.d.
  • the RVs are normally distributed

The last condition is important: the \(t\)-distribution only provides a valid approximation for small samples if the population is normally distributed. This condition can be relaxed when \(n\) is large enough.