1 Populations, Samples
2 Sample Variability
3 Sampling Error and Bias
4 Reliability of an Estimator
5 The Law(s) of Large Numbers
- 5.1 Strong LLN
- 5.2 Weak LLN
6 Sampling Distributions—a demonstration
7 Sampling Distributions—some theory
8 Summary: Two Convergence Theorems
- 8.1 When the LLN might fail
- 8.2 Square-root convergence

1 Populations, Samples

Terminology recap:

A population is the complete set of observations required to conduct an experiment or answer a question fully. It’s often unrealistic (or impossible) to get data on a whole population.

A sample is a subset of data from a population. Samples are used to represent populations, and make inferences about population characteristics which are often unknown. The sample size (number of observations) is typically denoted \(n\).

An estimator is a sample parameter that is used to estimate an unknown population parameter. There are point estimators and interval estimators, e.g. a sample mean is a point estimator for the population mean, and a confidence interval for a mean is an interval estimator for the population mean.

A random sample is one where each observation is chosen randomly from the population.

Notation conventions:

parameter	population	sample
mean	\(\mu\)	\(\bar x\)
standard deviation	\(\sigma\)	\(s\)
correlation coefficient	\(\rho\)	\(r\)
regression coefficient	\(\beta\)	\(b\)

2 Sample Variability

Since every sample of data is unique, estimates will vary randomly across different samples. This is called variability. A truly random sample should be representative of the population, and it should give accurate estimates of population parameters. But pure randomness is difficult to achieve, especially when sample size is small.

To demonstrate this, the following code draws a random sample of 10 observations from a sample space comprising integers from 1 to 6 (like rolling a die 10 times):

diceroll = sample(x = c(1,2,3,4,5,6), size = 10, replace = TRUE)
diceroll

##  [1] 2 3 1 3 1 1 5 6 3 3

If \(X\) is the RV for the outcome of a single roll, the above output is a random sample of 10 observations of \(X\). The sample mean, \(\bar X\), is:

mean(diceroll)

## [1] 2.8

If we repeat the process with another random sample, we will (likely) get a different set of observations, and a different sample mean:

diceroll2 = sample(x = c(1,2,3,4,5,6), size = 10, replace = TRUE)
diceroll2

##  [1] 2 2 3 6 5 5 4 2 3 2

mean(diceroll2)

## [1] 3.4

Compare these sample means to the theoretical mean of the population, in this case \(\mu = 3.5\). Clearly the error in each estimate varies substantially across the two observed samples.

This a trivial example, since the RV in question follows a known distribution (discrete uniform), and we know the population mean exactly. In most experiments the population distribution is unknown—all we have is a sample distribution, and whatever estimates it produces.

3 Sampling Error and Bias

There are two sources of error in sample-derived estimates:

sampling error (aka random error), the random tendency of an estimate to vary across different samples
bias, the systematic tendency of an estimate to over/underestimate the true population parameter

The sampling error is a result of variability—the tendency of data to vary randomly across different samples. In the ensuing sections we introduce two convergence theorems that describe how the variability of an estimate depends on sample size.

Bias is the result of systematic (i.e. nonrandom) processes interfering with observations, e.g. poor calibration or selective sampling. In module 3 we will look at sources of bias.

4 Reliability of an Estimator

Reliability refers to the consistency of an estimator across different samples. The reliability of a result is closely related to its variability. An estimator with high variability, e.g. the sample mean in the “diceroll” simulation above, is considered unreliable, as its value tends to vary substantially across different samples.

5 The Law(s) of Large Numbers

The law of large numbers is a theorem that describes how the variability of an estimate decreases with sample size.

Here’s a simple demonstration of this: the following code runs the “diceroll” simulation three times, with sample sizes \(n = \{ 30, 1000, 10000 \}\). The sample mean is computed for each value of \(n\).

for (i in c(30,100,10000)) {
  diceroll = sample(x = c(1,2,3,4,5,6), size = i, replace = TRUE)
  cat(paste('sample mean when n =', i, ':', round(mean(diceroll),3),'\n'))
}

## sample mean when n = 30 : 3.533 
## sample mean when n = 100 : 3.53 
## sample mean when n = 10000 : 3.5

Predictably, increasing sample size brings the sample mean closer to its theoretical value, \(\mu=3.5\). This is the law of large numbers—it describes how the sample mean converges to its true value as the sample size becomes large.

5.1 Strong LLN

The strong law of large numbers says the sample mean converges exactly to the true mean when \(n\) is large:

\[\bar X \rightarrow \mu \hspace{1cm} n \rightarrow \infty\]

5.2 Weak LLN

The weak law of large numbers says the sample mean converges in probability to the true mean when \(n\) is large. Specifically, it says the probability that the sampling error is larger than some nonzero constant goes to zero as \(n\) becomes large:

\[\text{P}\big( \big| \bar X - \mu \big| > c \big) \rightarrow 0 \hspace{1cm} n \rightarrow \infty\]

This is a cumbersome definition, but all it really says is the sample mean \(\bar X\) is likely to be near \(\mu\) when \(n\) is large. The weak LLN leaves open the possibility that \(\big| \bar X - \mu \big| > c \;\) at infrequent intervals, i.e. that occasionally the sample mean might deviate from the true mean, even with very large \(n\).

By contrast, the strong LLN describes an exact convergence to the true mean. It implies that \(\big| \bar X - \mu \big| < c \;\) will be true in all cases when \(n\) is large. This is a stronger convergence condition than the weak LLN, and there are many distributions for which the weak law holds, but the strong law doesn’t.

6 Sampling Distributions—a demonstration

In the “diceroll” simulation above we showed how repeating a random process with a new sample will yield different sample statistics. Below we show what happens when we do this repeatedly. In the following code, the same experiment (a die being rolled 30 times) is repeated ten times, each time with a new random sample. The 10 sample means are printed:

N = 10 #number of trials
samplemeans = NULL #vector of sample means

for (i in 1:N) {
  diceroll = sample(x = c(1,2,3,4,5,6), size = 30, replace = TRUE)
  samplemeans[i] = mean(diceroll)
}

samplemeans

##  [1] 3.400000 3.166667 3.200000 3.766667 3.766667 3.333333 3.533333 3.366667
##  [9] 3.800000 3.133333

If \(X\) is the RV for the outcome of a dice roll, the above output is not 10 observations of \(X\), but a collection of sample means, \(\bar X\), across 10 random samples, each of which has 30 observations of \(X\). In other words, the above output is a distribution of sample means.

Each sample mean is calculated \(\bar X = \frac 1n \sum_i^n X_i\), where \(X_i\) denotes a single observation in one of the samples.

As it turns out, you can model the sample mean \(\bar X\) as a random variable (provided you have a collection of sample means), which gives a sampling distribution. Below is a histogram showing the 10 sample means from above:

ggplot(data = as.data.frame(samplemeans), aes(x = samplemeans)) +
  geom_histogram(binwidth = 0.1) + 
  ggtitle(TeX('sampling distribution of $\\bar{X}$ (10 trials)'))

This is called the sampling distribution of the mean. It shows the distribution of the sample mean, \(\bar X\), across several samples. Each sample in this case has size \(n=30\) (i.e. 30 independent observations of \(X\)).

As you repeat the experiment with more and more random samples, it turns out the shape of the sampling distribution converges. The plots below show the sampling distributions of \(\bar X\) using 100, 1000, and 10,000 random samples respectively, each with \(n=30\).

samplemeans=NULL

for (i in 1:11100) {
    dicetoss = sample(x = c(1,2,3,4,5,6), size = 30, replace = TRUE)
    samplemeans[i] = mean(dicetoss)
}

xbar100 = samplemeans[1:100]; 
xbar1000 = samplemeans[101:1100]; 
xbar10000 = samplemeans[1101:11000]

plot1 = ggplot(data = as.data.frame(xbar100), aes(x = xbar100)) +
  geom_histogram(binwidth = 0.1, aes(y = ..density..)) +
  ggtitle(TeX('sampling dist. of $\\bar{X}$ (100 trials)')) +
  xlab(TeX('$\\bar{X}$'))  

plot2 = ggplot(data = as.data.frame(xbar1000), aes(x = xbar1000)) +
  geom_histogram(binwidth = 0.1, aes(y = ..density..)) +
  ggtitle(TeX('sampling dist. of $\\bar{X}$ (1000 trials)')) +
  xlab(TeX('$\\bar{X}$'))  

plot3 = ggplot(data = as.data.frame(xbar10000), aes(x = xbar10000)) +
  geom_histogram(binwidth = 0.1, aes(y = ..density..)) +
  ggtitle(TeX('sampling dist. of $\\bar{X}$ (10,000 trials)')) +
  xlab(TeX('$\\bar{X}$'))  

grid.arrange(plot1, plot2, plot3, ncol = 3)

Below are the summary statistics for each of these distributions:

xbar_summary = data.frame(
              number.of.trials = c('100','1000','10,000'), 
              mean = c(round(mean(xbar100),3), 
                   round(mean(xbar1000),3), 
                   round(mean(xbar10000),3)), 
                          standard.deviation = c(round(sd(xbar100),3), 
                         round(sd(xbar1000),3), 
                         round(sd(xbar10000),3)))

kable(xbar_summary)

number.of.trials	mean	standard.deviation
100	3.495	0.278
1000	3.485	0.303
10,000	3.500	0.311

There are three important things to take away from this demonstration:

1 — the sampling distribution of \(\bar X\) approached a normal curve with a large number of trials

The convergence of a sampling distribution to a normal curve is the premise of the central limit theorem. Note how this convergence occurred even though the original RV was not normally distributed (the outcomes of a dice roll follow a discrete uniform distribution).

This is an important result in statistics, as it implies the normal distribution can be used to model many random processes, even those that follow different distributions.

2 — the means of the sampling distributions tended to cluster around \(\bar X \sim 3.5\), the theoretical mean of the distribution

The LLN predicts the convergence of the sample mean to the true mean, so this is an expected result.

3 — the standard deviations of the sampling distributions tended to cluster around \(\text{SD}[\bar X] \sim 0.3\)

Note how this is much smaller than the s.d. of the population distribution.

For reference, the s.d. of a discrete uniform distribution with values from 1 to \(a\) is given by \(\sigma = \sqrt{\frac{a^2-1}{12}}\). In this case, with values from 1 to 6, the population s.d. is \(\sigma = 1.708\). The s.d. of the sampling distribution is clearly much smaller.

It turns out the s.d. of a sampling distribution is scaled down from the population s.d. by a factor of \(\sqrt n\):

\[\text{SD}[\bar X] = \frac{\sigma}{\sqrt n} = \frac{1.708}{\sqrt{30}} = 0.31\]

This is another important result in statistics (next).

7 Sampling Distributions—some theory

Let \(X\) is a random variable for the outcome of a random process, and \(X_i\) be a single value/outcome/observation of that process. If you have a sample of \(n\) observations of this random process, you have a collection of RVs \(X_1, X_2, ..., X_n\).

These RVs \(X_1, X_2, ..., X_n\) are said to be independent and identically distributed (i.i.d.) if:

each one is drawn from the same underlying population (which has some unknown but fixed distribution)
the value of one outcome does not influence the values of other outcomes

We often assume that observations in a sample are effectively i.i.d.

The sample mean, \(\bar X\), is the mean of the observed outcomes in a random sample:

\[\bar X = \frac 1n \sum_i^n X_i\]

where \(X_i\) is a single observation in a random sample of size \(n\).

Different random samples will yield different values of \(\bar X\). The distribution of \(\bar X\) across different samples is called the sampling distribution of the mean.

7.1 The expected value of the sample mean

Recall the following properties of expectation:

the expected value of any one RV is simply the true mean of the population, i.e. \(\text{E}[X_i] = \mu\)
the expected value of each RV is the same (since they are i.i.d.)
the expected value of a sum of RVs is equal to the sum of the expected values of the RVs, i.e. \(\text{E}\big[ \sum_i X_i \big] = \sum_i \text{E}[X_i]\)
expectation scales linearly: \(\text{E}[aX] = a \text{E}[X]\)

It should follow intuitively that the expected value of a sample mean is simply the true mean: \(\text{E}[\bar X] = \mu\).

To see this algebraically, apply the expectation operator to the formula for the sample mean, and do some wrangling:

\[\text{E}[\bar X] = \text{E}\bigg[ \frac 1n \sum_i^n X_i \bigg] = \frac 1n \text{E}\bigg[ \sum_i^n X_i \bigg] = \frac 1n \sum_i^n \text{E}[X_i] = \frac 1n \sum_i^n \mu = \frac 1n \cdot n\mu = \mu \nonumber\]

\[\therefore \;\; \text{E}[\bar X] = \mu\]

You saw this result in the demonstration above: all three sampling distributions were centered around 3.5, the theoretical mean of the population distribution.

7.2 The variance of the sample mean

Recall the following properties of variance:

the variance of any one RV is simply the true variance of the population, i.e. \(\text{Var}[X_i] = \sigma^2\)
the variance of each RV is the same (since they are i.i.d.)
the variance of a sum of independent RVs is equal to the sum of the variances of the RVs, i.e. \(\text{Var}\big[ \sum_i X_i \big] = \sum_i \text{Var}[X_i]\)
variance scales by a square law: \(\text{Var}[aX] = a^2 \text{Var}[X]\)

The last property is what gives the variance of a sample mean its dependency on sample size: \(\text{Var}[\bar X] = \frac{\sigma^2}{n}\).

To see this algebraically:

\[\text{Var}[\bar X] = \text{Var}\bigg[ \frac 1n \sum_i^n X_i \bigg] = \frac{1}{n^2} \text{Var}\bigg[ \sum_i^n X_i \bigg] = \frac{1}{n^2} \sum_i^n \text{Var}[X_i] = \frac{1}{n^2} \sum_i^n \sigma^2 = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n} \nonumber\]

\[\therefore \;\; \text{Var}[\bar X] = \frac{\sigma^2}{n}\]

The takeaway: the variance of a sample mean decreases as n increases.

This is why, in the demonstration above, the sample s.d. was smaller than the population s.d.—it was scaled down by a factor of \(\sqrt n\). Note \(n\) always refers to the size of the sample, not the number of random samples used to create the sampling distribution.

You can understand this phenomenon intuitively through the LLN, which shows that increasing \(n\) makes the sample mean converge to the true mean. This is only possible if the data points are becoming more scattered around the true mean—which means the distribution must be getting taller and narrower as \(n\) increases.

The code below plots three sampling distributions, with sample sizes \(n=\{ 10,50,100 \}\). The underlying RV is the same as above—a discrete uniform distribution with values from 1 to 6.

generator = function(n) {
  mean(rdunif(n, min = 1, max = 6))
}

sampledata = data.frame(Xbar = c(replicate(10000, generator(10)), 
                                 replicate(10000, generator(50)), 
                                 replicate(10000, generator(100))), 
                        n = c(rep(10,10000), rep(50,10000), rep(100,10000)) %>% as.factor())

ggplot(sampledata, aes(x = Xbar, fill = n)) + 
  geom_density(alpha = 0.3) + xlab(TeX('$\\bar{X}$'))

Clearly, the dispersion of the sampling distribution decreases as \(n\) increases.

7.3 The standard error

The standard deviation of the sample means is known as the standard error:

\[\text{SE}= \text{SD}[\bar X] = \sqrt{\text{Var}[\bar X]} = \frac{\sigma}{\sqrt n}\]

8 Summary: Two Convergence Theorems

The two convergence theorems can be summarized as follows:

The law of large numbers describes how the sample mean converges in value to its theoretical value for large \(n\).

The central limit theorem describes how the sample mean converges in distribution to a normal distribution for large \(n\).

Below are two important caveats to these theorems.

8.1 When the LLN might fail

It’s possible the LLN may not hold if the random variables are themselves so dispersed that they don’t have an expected value. These are sometimes called “heavy-tailed” distributions. e.g. the Cauchy distribution.

8.2 Square-root convergence

Note that when \(n\) is initially small, increasing the sample size makes a significant change to the distribution, rapidly bringing the sample mean closer to the true mean, and rapidly reducing the variance. But as \(n\) gets larger, it takes longer for the sample mean to get closer to the true value by the same amount—the law of diminishing information.

You can see this in the expression for standard error: \(\text{SD}[\bar X] = \sigma / \sqrt n\). The dispersion of the sampling distribution does not decrease linearly, but over the square root of sample size.

You can also see this in the plot in section 7.2: the dispersion of the distribution reduces substantially when \(n\) is increased from 10 to 50, but the change is more marginal when \(n\) is increased from 50 to 100.

Samples, Sampling Distributions