Validity and Bias

1 Two Sources of Error in an Estimator

To recap: in chapter 10 we mentioned that there two sources of error in estimation problems:

Both can contribute to the overall error in an estimate. In module 2 we focused on variability, and we introduced two convergence theorems that describe how variability decreases with sample size.

Bias is a more insidious source of error—its presence indicates there is something wrong with the sampling/measurement technique, or with the design of the experiment. Bias is systematic (not due to random fluctuations) and cannot be reduced by increasing the sample size.

Formally, the bias of an estimator \(\hat\theta\) is the difference between its expected value and the underlying population value \(\theta\) being estimated:

\[\text{Bias}[\hat\theta] = \text{E}[\hat\theta] - \theta\]

i.e. there is bias when the expected value of the estimator differs from the true value.

1.1 Reliability vs. Validity

Two important concepts associated with variability and bias:

  • reliability—the consistency of a result, i.e. the extent to which it produces similar values in different samples
  • validity—the accuracy of a result, i.e. the extent to which it reflects what is actually trying to be measured

Reliability is related to variability—a result is reliable if it has low variability, and vice versa. Validity is related to bias—a result is valid if it has high accuracy, and vice versa.

Note that reliability does not imply validity—e.g. a scale can produce reliable (consistent) estimates of mass, but if incorrectly calibrated, its estimates will be systematically incorrect.

There are two major kinds of validity in experiments: internal validity and external validity.

1.2 External validity

External validity concerns the generalizability of a result, i.e. whether it will hold in different samples/settings. For a result to have external validity, it should be portable, and it should not depend on in-sample conditions.

A common threat to external validity is a non-representative sample—this is known as sampling bias. Results from a biased sample will only be valid for that sample, and cannot be generalized to the entire population.

1.3 Internal validity

Internal validity concerns the correct identification of causal relationships in an experiment. For a result to have internal validity, it must be controlled for by confounding variables, and it must be properly measured/identified. Some common threats to internal validity:

  • confounding variables/events that are not controlled for—omitted variable bias
  • incorrectly identified variables and causal relationships—specification bias
  • poorly made/calibrated measurements—measurement error bias

In the remainder of this module we will examine sources of bias.

2 Estimator Bias

An estimator is biased if its expected value differs from the true value of the parameter being estimated.

If \(\hat\theta\) is an estimator for the true parameter \(\theta\), the bias of the estimator can be written:

\[\text{Bias}[\hat\theta] = \text{E}[\hat\theta] - \theta\]

2.1 Unbiased estimators—the sample mean

From asymptotic theory we know that the expected value of the sample mean converges to its true value when \(n\) is large: \(\text{E}[\bar X] \longrightarrow \mu\). The bias of the sample mean is thus:

\[\text{Bias}[\bar X] = \text{E}[\bar X] - \mu = \mu - \mu = 0\]

In other words, the sample mean is an unbiased estimator for the true mean. If you want to estimate the tue mean of a population, the sample mean should give you an accurate estimate (provided there are no other sources of bias in the sample).

Not all estimators are inherently unbiased, as the next section will demonstrate.

2.2 Biased estimators—the sample maximum

Suppose now we want to estimate the maximum value of a distribution. Is the sample maximum an unbiased estimator for the true maximum? As it turns out, the answer is no.

To demonstrate this, let \(X\) be a continuously distributed RV between 0 and 10, i.e. \(X \sim \mathcal U(0,10)\). The following code generates 100 random observations of \(X\) and plots the histogram:

X = runif(n = 100, min = 0, max = 10)

ggplot(aes(x = X), data = as.data.frame(X)) + 
  geom_histogram(binwidth = 0.5) 

If \(\theta\) denotes the true maximum of the population distribution, we know that in this case \(\theta = 10\). The sample maximum, \(\hat\theta\), is:

max(X)
## [1] 9.962283

If we repeat the process many times, we can construct a sampling distribution of the maximum:

thetahat = replicate(n = 1000, max(runif(n = 100, min = 1, max = 10)))

ggplot(aes(x = thetahat), data = as.data.frame(thetahat)) + 
  geom_histogram(bins = 50) + ggtitle('sampling distribution of the maximum') + xlab(TeX('$\\hat{\\theta}$')) 

Note how the sampling distribution of \(\hat\theta\) is neither bell-shaped nor centered at the true maximum \(\theta\). In fact, based on this sampling distribution, the expected value of the sample maximum is:

mean(thetahat)
## [1] 9.911053

Clearly, the sample maximum underestimates the true maximum, since \(\text{E}[\hat\theta] - \theta =\) 9.911 -10 \(< 0\). The bias of the sample maximum in this example is:

mean(thetahat) - 10
## [1] -0.08894701

With a bit of calculus you can show that, in fact, the expected value of the sample maximum is

\[\text{E}[\hat\theta] = \frac{n}{n+1} \theta\]

i.e. the sample maximum is a biased estimator for the true maximum, since it will consistently underestimate the true maximum by a factor \(\frac{n}{n+1}\). See the proof here.

Using this fact we can construct a bias-corrected estimator for the true maximum:

\[\frac{n+1}{n} \hat\theta\] where \(\hat\theta\) is the sample maximum.

For the above example, where \(n=100\), the bias-corrected estimate of the true maximum is:

mean(thetahat)*((100+1)/100)
## [1] 10.01016

which is clearly much more accurate than the uncorrected sample value.

2.3 Bessel’s Correction

Another example of estimator bias arises when estimating the variance of a distribution. Formally, the population variance is defined \(\text{Var}[X] = \text{E}[(X-\mu)^2]\), or written as a sum:

\[\sigma^2 = \frac 1n \sum_i^n (X_i - \mu)^2\]

It turns out this formula is only valid for population data. When using sample data, it will consistently underestimate the population variance by a factor \(\frac{n-1}{n}\):

\[\text{E}\bigg[ \frac 1n \sum_i^n (X_i - \bar X)^2 \bigg] = \frac{n-1}{n} \sigma^2\]

See the proof here. This bias can be remedied by using \(n-1\) in the denominator instead of \(n\). The bias-corrected formula for population variance is:

\[s^2 = \frac{1}{n-1} \sum_i^n (X_i - \bar X)^2\]

which is known as Bessel’s correction..

Note that when \(n\) is large, the difference between \(\frac 1n\) and \(\frac{1}{n-1}\) becomes negligible—the difference between the two formulae is significant only for small samples. This is why we use the \(t\)-distribution for small samples, since it uses the bias-corrected formula for variance, which produces a normal curve with fatter tails. This is also why the \(t\)-distribution only has one parameter, DoF, since each \(n\) produces a slightly different curve.

3 Sampling Bias

To recap: a random sample is one where each observation is selected randomly from the population, and has an equal probability of being selected.

Sampling bias occurs when the observations in a sample are selected in a non-random way (certain observations in the population have a higher/lower probability of being selected than others). It results in a non-representative sample.

3.1 Types of sampling bias

Below are some common types of sampling bias—

Self-Selection Bias—can occur if study participants self-select (i.e. have control over whether to participate). It could be that the people who voluntarily opt in tend to represent a certain group in the population (e.g. with particularly strong opinions/characteristics), resulting in a biased sample.

  • e.g. using responses from polls or voluntary surveys—respondents may tend to have stronger opinions than nonrespondents, resulting in an overrepresentation of “extreme” opinions.
  • also see—participation bias.

Referral bias (aka Berkeson’s fallacy)—can occur if the study population is selected from a certain environment that differs from the general population/control group.

  • e.g. in hospital studies, if the admissions rates to the hospital are different for certain ailments (e.g. admission rates of exposed cases and controls differ), the association between exposure and ailment can be distorted. This can result in spurious negative correlations between ailments. Example.
  • also see—admission rate bias.

Survivorship Bias—can occur if the study only focuses on observations that “survived” after some selection criteria. Ignoring/excluding “failures” can result in optimistic beliefs about the characteristics associated with “successes”.

  • e.g. WWII planes—during the war the US military thought it could reduce aircraft casualties by adding extra armour to its fighter planes where the returning planes showed most damage. But in doing so it only considered planes that survived—the planes that were shot down were excluded from the damage analysis. This is a classic example—read more here.

Response Bias—the tendency for participants to give misleading responses due to behavioral/environmental inputs. Read more here.

3.2 Sampling methods

Below are some common methods for sampling from a population—

Simple Random Sampling—a method whereby each observation in the population has an equal probability of being selected. If carried out properly, this method should minimize bias and ensure a fairly representative sample. An issue with SRS is its susceptibility to random sampling error, which may accidentally result in a biased sample—e.g. if a population has 50% woman and 50% men, random sampling will on average produce representative proportions of each gender, but in any one sample the proportions may be slightly off due to variability. The smaller the sample, the more susceptible it is to this problem.

Stratified Sampling—a method whereby the population is divided into categories or “strata”, and each stratum is then sampled individually using a second sampling method (usually SRS). This method is useful if the population comprises distinct groups (e.g. people of different races) and it’s of particular importance that each group is fairly represented (e.g. if the group is correlated with some effect/response). Stratified sampling will ensure each group is represented in the sample, where simple random sampling may over/underrepresent certain groups due to chance variability. Note that stratified sampling is only advantageous over SRS if the population can be divided into distinct groups that are relatively homogeneous (i.e. the groups should have lower variability than the population as a whole). If so, stratified sampling will produce smaller errors in estimation. Check out this link for more on errors and CIs in stratified sampling.

3.3 Correcting for sample bias

If entire groups in a population are excluded from a sample, there is no way to correct the bias in subsequent estimates. However if the sample underrepresents certain groups, and the true population proportions are known (or can be guessed), then the bias can be corrected by weighting each group.

E.g. if a population is known to have 50% females and 50% males, but a sample has 60 females and 40 males, the bias could be corrected by weighting each female observation by \(\frac{50}{60}=0.833\) and each male observation by \(\frac{50}{40}=1.25\). This correction would make subsequent estimates have the same expected value as they would in a representative sample. Note that weighted sampling requires prior knowledge of the population proportions, and it doesn’t account for the possibility that females and males might have differed in their likelihood of being selected (this could be a problem in self-selected or environment-dependent samples).

4 Some Consequences of Bias

If \(X\) is a RV for the observed outcomes in a random unbiased sample, then we should expect that \(\text{E}[X] = \mu\), and that the larger the sample, the better the approximation (the LLN).

But if the sampling method is biased, and certain segments of the population are excluded/underrepresented, the expected value of \(X\) is no longer the population mean, but rather the mean of the biased subset, i.e. \(\text{E}[X] = c\) where \(c \neq \mu\). In this case increasing \(n\) will make \(\bar X\) converge to \(c\), not \(\mu\).

4.1 The LLN and CLT

In general, paramater estimates will converge to expected value of the (sub)population represented in the sample.

The general statement of the law of large numbers:

\[\bar X \longrightarrow \text{E}[X]\]

i.e. when \(n\) is large, the sample mean will converge to its expected value. Convergence to the true mean \(\mu\) requires that \(\text{E}[X]=\mu\), which is only possible if the sampling method is unbiased and the sample is representative.

Similarly, the CLT can be stated generally:

\[\bar X \sim \mathcal N \bigg( \text{E}[X], \frac{\text{Var}[X]}{n} \bigg)\]

4.2 Effects on intervals

Since confidence intervals are centered on the sample mean, \(\bar X\), the probability that a given interval contains the true mean depends on how close \(\bar X\) is to \(\mu\). If the sample is biased, confidence intervals will be centered somewhere other than \(\mu\), and the probability of capturing the true mean will converge to zero as \(n\) increases.

E.g. suppose a population has true mean \(\mu = 10\). If our sample is biased, such that \(\text{E}[X] = 12 \neq \mu\), a 95% confidence interval computed on \(\bar X\) will be centered at 12, not 10. With \(s = 6\) and \(n=20\), a 95% confidence interval will asymptotically cover the following region:

i.e. with \(n=20\) it contains \(\mu\), but only just.

If the sample had \(n=60\), the 95% confidence interval no longer contains \(\mu\):

If the sample had \(n=200\):

The takeaway—while increasing \(n\) reduces the variability of an estimate, if the sample is biased it will also decrease the probability that the interval contains the true value. As \(n\) gets large the “coverage” probability of the interval will converge to zero—eventually none of the intervals will contain the true parameter, even if some initially did.

The theme of trading between bias and variance is an important one in statistics—often it’s not possible to minimize both simultaneously.

4.3 Effects on power

Recall the power of a test is the probability of correctly rejecting a false null.

E.g. if a null hypothesis has \(H_0: \mu = 12\) but the true mean is \(\mu = 9\), the power of the test can be visualized as the following region:

But if the sample is biased, such that \(\text{E}[X] = 10.5\), the power of the test is now:

The green line represents the distribution of the biased sample.

The green line represents the distribution of the biased sample.

Here the bias brings the sample mean closer to the null, which reduces the power of the test.

Conversely, if the sample bias is such that \(\text{E}[X] = 7.5\), the power of the test is:

Here the bias brings the sample mean further away from the null, which increases the power of the test.

The takeaway—a biased sample can increase or decrease the power of the test, depending on whether the bias brings the sample mean closer/further from the true mean.

4.4 Effects on Type I Error

If the null hypothesis is true, a biased sample can make us lose control of the type I error.

E.g. if a null hypothesis has \(H_0: \mu = 12\), and if the null is true, then under an unbiased sample the type I error is:

But if the sample is biased, such that \(\text{E}[X] = 14\), the type I error is:

If the test is two-tailed, then bias in the other direction will also increase the type I error: