1 Hypothesis Tests

A statistical hypothesis is an assumption about one or more population parameters. The goal is to use data to determine whether a given hypothesis should be retained or rejected.

To test a hypothesis you compare two competing models for the population parameter(s). The null hypothesis is the proposed model—this is the assumption you want to test. The null can be based on anything—the results of a previous experiment, a status quo, a hunch, etc.

To test the null you compare it to a competing model for the population parameter(s), usually in the form of a (new) sample of data. If the competing model yields vastly different results to those predicted by the null, you have evidence to reject the null. The alternative hypothesis describes the scenario under which the null is not true.

the null hypothesis, \(H_0\): a proposed model for the population parameter(s)
the alternative hypothesis, \(H_1\): that the null is not true

There are two possible outcomes from a hypothesis test: either you reject the null or you fail to reject it.

Note the alternative hypothesis is always stated as a negation of the null. Hypothesis tests give a framework for rejecting a given model, but not for uniquely specifying one.

1.1 Example

Suppose you develop a hypothesis, using the pay gap data, that the true mean hourly wage gap is 12.36%. The null and alternative hypotheses are:

\[H_0: \mu = 12.356 \nonumber\] \[H_1: \mu \neq 12.356 \nonumber\]

Suppose that the following year you collect a new sample of data, which for the same variable yields a sample mean \(\bar X = 9.42\). We can now use this new evidence to test the initial hypothesis and determine whether it should be rejected.

First you must quantify the probability of getting the observed value if the null hypothesis were really true. If this probability is sufficiently low, there’s a good chance the null hypothesis is not true.

The significance level (\(\alpha\)) of a test is the probability threshold below which you definitively reject the null. The most common significance level used is \(\alpha = 0.05\)—it means you reject the null if the observed value lies in the most extreme 5% of values under the null distribution. Sometimes a significance level of \(\alpha = 0.01\) is also used.

The rejection region of a test is the range of values for which which you reject the null. The size of the rejection region is determined by the significance level you use. Below are two plots of the distribution of \(\mu\) under the null hypothesis (as defined above), with rejection regions for \(\alpha = 0.05\) and \(\alpha = 0.01\):

The total area occupied by the rejection region is simply the significance level. The critical values are the bounds of the rejection region:

\[\bigg\{ \mu - c \cdot \frac{s}{\sqrt n} \;\; , \;\; \mu + c \cdot \frac{s}{\sqrt n} \bigg\}\]

where \(\mu\) is the mean under the null distribution. In this case the distribution of null has \(\mu = 12.36\), \(s = 16.01\), and \(n = 153\). If you conduct the test at the \(\alpha = 0.05\) level, the critical values are the 2.5th and 97.5th percentiles of the distribution, so \(c = t_{\{(1-\alpha/2),df=152\}} = 1.976\). Using these values the rejection region is:

\[\bigg\{ 12.36 - 1.976 \cdot \frac{16.06}{\sqrt{153}} \;\; , \;\; 12.36 + 1.976 \cdot \frac{16.06}{\sqrt{153}} \bigg\}\] \[\Longrightarrow \;\; \{ 9.8, 14.9 \}\]

i.e. the rejection region in this test is \(\bar X < 9.799\) & \(\bar X > 14.913\). Note the critical values for an \(\alpha=0.05\) test are also just the bounds of a 95% confidence interval.

A test is statistically significant if the observed value falls in the rejection region. In this example, where the observed value is \(\bar X = 9.42\) and \(\alpha = 0.05\):

i.e. the observed value clearly falls in the rejection region. You can thus reject the null at the 5% significance level, and conclude that true mean hourly wage gap is no 12.35%.

Note if you had used a significance level of \(\alpha = 0.01\), you would have reached a different conclusion:

i.e. with \(\alpha = 0.01\), the observed value does not fall in the rejection region, and thus you cannot reject the null. The choice of significance level can make or break the fate of a hypothesis.

1.2 The p-value of a test

Another way to conduct a hypothesis test is by looking at \(p\)-values.

The p-value of a test is the probability of getting a result at least as extreme as the observed value, under the null hypothesis.

In this example the observed value is \(\bar X = 9.42\). The probability of getting a value at least as extreme as \(\bar X = 9.42\) is the following region:

Each region has a probability of 0.0116, which means the total probability of getting a value at least as extreme as the observed value is 0.0233. Thus the \(p\)-value of this test is 0.0233.

The test is statistically significant if the \(p\)-value is smaller than the significance level of the test. If you had used \(\alpha = 0.05\), the result would have been statistically significant, and you would have rejected the null; but if you used \(\alpha = 0.01\), you couldn’t have rejected the null.

Note this method is equivalent to the previous one (i.e. computing the rejection region and seeing whether it contains the observed value).

1.3 Testing workflow

To summarize: below are the basic steps to follow when conducting a hypothesis test:

state the null and alternative hypotheses. The null is the assumption you’re testing—it’s a proposed model for one or more population parameters. The alternative hypothesis describes the scenario where the null is not true.
choose a significance level, \(\alpha\), for the test—this is the probability threshold below which you will definitively reject the null. Common levels are \(\alpha = 0.05\) and \(\alpha = 0.01\).

Then, either:

determine the rejection region of the test—a range of values that would cause you to reject the null, if the observed value was seen to lie in this range. The bounds of the rejection region are determined by the significance level.
determine whether to reject the null hypothesis based on whether the observed value lies in the rejection region or not

or:

compute the \(p\)-value of the test—the probability of getting a value at least as extreme as the observed value
reject the null if the \(p\)-value is smaller than the significance level

2 Errors and Power

In a hypothesis test, the null is assumed until there is strong evidence to suggest you should reject it. Of course, the outcome of a single hypothesis test doesn’t necessarily lead to the correct result—the \(p\)-value (or whatever rubric you use) only represents a likelihood, and there’s always a chance the conclusion you draw from a given test is the wrong conclusion, even if the likelihood is very low.

2.1 Type I and Type II Errors

There are two kinds of error we can make in a hypothesis test:

Type I Error: rejecting the null, when the null is actually true. The probability of type I error is no more than the significance level of a test.
Type II Error: failing to reject the null, when the null actually is false.

	retain null	reject null
\(H_0\) true	\(\checkmark\)	type I error
\(H_0\) false	type II error	\(\checkmark\)

The foremost goal in hypothesis testing is ensuring the chance of type I error is low (since most hypothesis tests are out to disprove something, it would be particularly terrible if you rejected a null that was actually true.) Since the type I error is determined by the significance level, you must choose a significance level that is appropriate for the context—e.g. if type I errors are particularly dangerous, you should use a small significance level (e.g. \(\alpha = 0.01\) or \(\alpha = 0.005\)), to minimize your chance of rejecting the null when it’s actually true.

The two plots below should help you visualize the type I and type II error. Using the same example as above, the left figure shows the type I error if the true mean wage gap is really 12.36% (i.e. \(H_0\) true). The right figure shows the type II error if the true mean wage gap is really 8% (i.e. \(H_0\) false), with true distribution overlaid in pink:

Note that the type I error is simply the rejection region of the test (\(\alpha\)). From the figure above you should be able to see why reducing the type I error must also increase the type II error.

2.2 The power of a test

Another useful concept in hypothsis testing is power—this is the probability of correctly rejecting a false null. Note that power is the complement of type II error, i.e. Power = 1 - T2E. The secondary goal in hypothesis testing, after choosing the test with the smallest type I error, is to choose the test with maximum power—typically you want power to be greater than 0.8 (i.e. the probability of type II error to be smaller than 0.2).

Using the same example as above, the power of the test can be visualized as the following region:

The further away the true mean is from the proposed mean, the higher the power of the test. To see this mathematically, consider the power function, \(\beta\), which describes the power of a test as a function of the true parameter \(\mu\), i.e. \(\beta = \beta(\mu)\). Since power is the probability of correctly rejecting the null, the power function for a two-tailed test is:

\[\beta(\mu) = \text{P}(t > c) + \text{P}(t < -c)\]

where \(c = t_{\{(1-\alpha/2),df=n-1 \}}\). In full, this can be expressed:

\[ \begin{aligned} \beta(\mu) &= \text{P}(t > c) + \text{P}(t < -c)\\ &= \text{P}\bigg( \frac{(\bar X - \mu_{H_0})}{s/\sqrt n} > c \bigg) + \text{P}\bigg( \frac{(\bar X - \mu_{H_0})}{s/\sqrt n} < -c \bigg)\\ &= 1- \text{P}\bigg( \frac{(\bar X - \mu_{H_0})}{s/\sqrt n} < c \bigg) + \text{P}\bigg( \frac{(\bar X - \mu_{H_0})}{s/\sqrt n} < -c \bigg)\\ &= 1- \text{P}\bigg( \frac{(\bar X - \mu_{H_0}) - \mu}{s/\sqrt n} < c - \frac{\mu}{s/\sqrt n} \bigg) + \text{P}\bigg( \frac{(\bar X - \mu_{H_0}) - \mu}{s/\sqrt n} < -c - \frac{\mu}{s/\sqrt n} \bigg) \\ &= 1- \text{P}\bigg( \frac{\bar X - \mu}{s/\sqrt n} < c - \frac{\mu - \mu_{H_0}}{s/\sqrt n} \bigg) + \text{P}\bigg( \frac{\bar X - \mu}{s/\sqrt n} < -c - \frac{\mu - \mu_{H_0}}{s/\sqrt n} \bigg) \\ &= 1- \Phi \bigg( c - \frac{\sqrt n (\mu - \mu_{H_0})}{s} \bigg) + \Phi \bigg( -c - \frac{\sqrt n (\mu - \mu_{H_0})}{s} \bigg) \end{aligned} \]

where \(\Phi\) is the cdf of the normal distribution (or the \(t\)-distribution), \(\mu\) is the true mean, \(\mu_{H_0}\) is the mean under the null, and \(c\) is the \(Z_{(1-\alpha/2)}\) or \(t_{\{(1-\alpha/2),df=n-1 \}}\) value. Below is a plot of the power as a function of the true mean, for the above example with \(\mu_{H_0} = 12.36\):

You can also introduce another concept: the size of a test, which is the largest probability of rejecting \(H_0\) when \(H_0\) is true. This is also known as the false rejection rate. The size of a test is the value of the power function when the true mean is equal to the null mean, which in this case is \(\beta(\mu = \mu_{H_0}) = \beta(0) = 0.05\).

3 Effect Size and Practical Significance

There is a major issue in hypothesis testing that arises when using large samples. In a two-tailed test the bounds of the rejection region are given by \(\mu \pm c \cdot \frac{s}{\sqrt n}\), where \(\mu\) is the value of the true mean under the null. Note how the bounds have an inverse dependency on \(n\)—that as \(n\) becomes larger, the distance between the null value and the rejection region becomes smaller. This means that for very large \(n\), we will reject the null for very small deviations of the observed value from the null—even if the deviations are so small that they have no practical significance.

To illustrate this, suppose you are testing the average height in a population, and the null hypothesis is \(\mu = 178\) cm. Suppose you have sample data with \(\bar X = 179\) cm, \(s = 23.5\) cm, and \(n = 100\). If you were to conduct the hypothesis test that \(\mu \neq 178\) cm at the \(\alpha = 0.05\) level, with \(c = Z_{(1-\alpha/2)} = 1.96\), you would have a rejection region as follows:

\[\Bigg\{ 178 - 1.96 \cdot \frac{23.5}{\sqrt{100}} \;\; , \;\; 178 + 1.96 \cdot \frac{23.5}{\sqrt{100}} \Bigg\}\] \[\Longrightarrow \;\;\; \{ 173.39 \; \text{cm},182.61 \; \text{cm} \}\]

i.e. you would reject the null if you observed a value more extreme than 173.39 cm or 182.61 cm. Since the observed value is 179 cm, you do not reject the null, which is an appropriate result since the difference between 178 cm and 179 cm is trivial in the context of people’s heights. But if you do the same test with a sample size \(n= 10,000\), the rejection region is:

\[\Bigg\{ 178 - 1.96 \cdot \frac{23.5}{\sqrt{10000}} \;\; , \;\; 178 + 1.96 \cdot \frac{23.5}{\sqrt{10000}} \Bigg\}\]

\[\Longrightarrow \;\;\; \{ 177.54 \; \text{cm},178.46 \; \text{cm} \}\]

i.e. with \(n = 10,000\) you would reject the observed value of 179 cm, even though in reality such a small difference doesn’t really warrant concluding that the true mean height is not 178 cm. This is the issue with large-sample hypothesis tests: results that are not practically significant can still be considered statistically significant, and the larger the sample size, the more this is going to happen.

One way to control this issue when conducting a study is to plan an appropriate sample size—i.e. to choose a sample size that will result in practically significant rejections only. First, you must decide what constitutes a practically significant result—e.g. in the above example, suppose you decide that a deviation of at least \(\pm 5\) cm from the null mean should permit us to reject the null (i.e. reject if \(\bar X < 173\) cm or \(\bar X > 183\) cm). What is the sample size necessary to make rejections of this magnitude only, at the \(\alpha = 0.05\) level? Simply set the bounds of the rejection equal to the desired values, and solve for \(n\), i.e. set \(\mu + c \cdot \tfrac{s}{\sqrt n} = 183\), which you will find gives \(n \approx 85\).

It’s also convenient to introduce a quantity called the effect size, \(d\), defined as follows:

\[d = \frac{\mu_1 - \mu_2}{s}\]

where \(\mu_1 - \mu_2\) is the mean difference (i.e. the desired deviation from the null mean, or effect). This definition of effect size is known as Cohen’s \(d\). E.g.—in the above example the effect size is \(d = \frac{5}{23.5} = 0.213\). To determine the appropriate sample size for a study, you must first determine the smallest effect size that would be considered practically significant, then solve for \(n\). It can be shown that \(n = \big( \tfrac cd \big)^2\).

4 One-Tailed Hypothesis Tests

So far we have demonstrated only two-tailed hypothesis tests, since we assumed the true value of the parameter could be either above or below the proposed value under the null. This is why, when constructing the rejection region, we split the area equally between the lowest and highest extremes of the distribution. Most tests are two-tailed, since generally you don’t know the direction of the true value relative to our proposed value.

But there are instances when you know (or assume) that the true value could only be higher (or lower) than the proposed value. In these contexts it may be more optimal to consider only the upper (or lower) tail of the distribution when conducting the test.

Example: suppose someone tells you the temperature of a room is 0 Kelvin (i.e. absolute zero; there is no possible temperature below this value), and you want to test this claim. In this case you can specify a directionality for the alternative hypothesis, since the true mean temperature cannot be below the proposed value:

\[H_0: \mu = 0 \nonumber\] \[H_1: \mu > 0 \nonumber\]

To conduct this one-tailed test at the 5% significance level, the rejection region comprises only the uppermost extreme 5% of values under the distribution:

i.e. with \(\alpha = 0.05\) the critical value for a one-tailed test is the 95th percentile of the distribution (in a two-tailed test with \(\alpha = 0.05\), the critical values are the 2.5th and 97.5th percentiles).

5 The t-test

Below are some examples of hypothesis tests using the \(t\)-distribution.

To recap: the \(t\)-distribution is used to approximate the limiting behavior of a sample mean when the true mean and variance of a population is unknown. The \(t\)-distribution has one parameter, degrees of freedom, which describes the exact shape of the bell curve. For a sample of size \(n\), the random variable \(\frac{\bar X - \mu}{s/\sqrt n}\) follows a \(t\)-distribution with \(n-1\) degrees of freedom. For more on this, see: \(t\)-distribution.

Below are some common examples of hypothesis tests conducted using the \(t\)-distribution.

5.1 One-sample t-test

A one-sample \(t\)-test can be used to test whether the mean of a population is equal to some specified value (the example demonstrated in section 13–1 was a one-sample \(t\)-test). For a two-sided test, the null and alternative hypotheses are:

\[H_0: \mu = k \nonumber\] \[H_1: \mu \neq k \nonumber\]

For data with sample mean \(\bar X\), sample s.d. \(s\), and sample size \(n\), the \(t\)-statistic for this test is:

\[t = \frac{\bar X - k}{\frac{s}{\sqrt n}}\]

For a two-sided test of size \(\alpha\), the bounds of the rejection region are:

\[\bigg\{ k - t_{(1-\alpha/2)} \cdot \frac{s}{\sqrt n} \; , \; k + t_{(1-\alpha/2)} \cdot \frac{s}{\sqrt n} \bigg\}\]

To perform the test you can either compute these bounds manually, or you can use the t.test() function, entering the array of sample data as one argument, and the proposed null value of the mean as the second argument.

E.g. for the pay gap data, testing whether the true mean hourly wage gap is zero:

t.test(paygap$DiffMeanHourlyPercent, mu = 0)

## 
##  One Sample t-test
## 
## data:  paygap$DiffMeanHourlyPercent
## t = 9.5468, df = 152, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   9.799123 14.913295
## sample estimates:
## mean of x 
##  12.35621

5.2 Welch two-sample t-test

The two-sample \(t\)-test can be used to test whether two groups have the same mean. The easiest way to do this is to think of the two groups as independent RVs, \(X\) and \(Y\), and to create a new RV that is a linear combination of both. For a two-sided test, the null and alternative hypotheses are:

\[H_0: \mu_X - \mu_Y = 0 \nonumber\] \[H_1: \mu_X - \mu_Y \neq 0 \nonumber\]

where \(\mu_X\) denotes the true mean of group \(X\) and \(\mu_Y\) denotes the true mean of group \(Y\).

The \(t\)-statistic for the test is:

\[t = \frac{\bar X - \bar Y}{\sqrt{\frac{s_X^2}{n_X}+\frac{s_Y^2}{n_Y}}}\]

where the subscripts \(X\) and \(Y\) denote the sample parameters for each of the two groups. In this case the distribution of the \(t\)-statistic follows a \(t\)-distribution with degrees of freedom as follows:

\[\text{DoF} = \frac{\bigg( \frac{s_X^2}{n_X} + \frac{s_Y^2}{n_Y} \bigg)^2}{\frac{(s_X^2 / n_X)^2}{n_X-1}+\frac{(s_Y^2 / n_Y)^2}{n_Y-1}}\]

which is known as the Welch-Satterthwaite equation. For simplicity, if doing the computations by hand, you can use the smaller of \(n_X-1\) and \(n_Y-1\) as the DoF.

In any case, the t.test() function in R will do these cumbersome calculations for you. E.g. testing whether FemaleBonusPercent and MaleBonusPercent have the same mean:

t.test(paygap$FemaleBonusPercent, paygap$MaleBonusPercent)

## 
##  Welch Two Sample t-test
## 
## data:  paygap$FemaleBonusPercent and paygap$MaleBonusPercent
## t = -0.13953, df = 303.99, p-value = 0.8891
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.084740  7.014151
## sample estimates:
## mean of x mean of y 
##  25.49542  26.03072

5.3 Two-sample t-test with pooled data

If the two groups under scrutiny have the same population variance (or you have good reason to believe that they do), the difference in their sample means can be better approximated using something called the pooled standard deviation:

\[s_p = \sqrt{\frac{s_X^2 (n_X - 1)+s_Y^2(n_Y-1)}{n_X+n_Y-2}}\]

The \(t\)-statistic for this test is:

\[t = \frac{\bar X - \bar Y}{s_p \cdot \sqrt{\frac{1}{n_X}+\frac{1}{n_Y}}}\]

where the \(t\)-statistic follows a \(t\)-distribution with \(n_X + n_Y - 2\) degrees of freedom.

Pooling the standard deviation gives an unbiased estimate of the common variance of the two groups, which gives a more accurate model overall. Note that if the two groups don’t have the same variance, the Welch two-sample test must be used instead.

5.4 Two-sample t-test with paired data

Two samples are said to be paired if both samples have observations on the same subjects (e.g. a group of patients before and after some form of treatment). In this case we can use the paired difference test, which has a \(t\)-statistic:

\[t = \frac{\bar X_D - \mu}{\frac{s_D}{\sqrt{n}}}\]

where the subscript \(D\) denotes the sample parameter for the difference in measurements for each subject—e.g. \(s_D\) is the s.d. of the difference in measurements for each subject between the two groups. For the \(i\)th subject, the difference would be something like \(D_i = X_{2i} - X_{1i}\). The \(t\)-statistic in this case follows a \(t\)-distribution with \(n-1\) degrees of freeedom.

Using the paired \(t\)-test increases the overall power of the test (when compared to the unpaired Welch test), and so is beneficial in contexts where it’s applicable.

In R you can perform a paired two-sample \(t\) test by specifying the parameter paired = TRUE in the t.test() function.

6 Testing Groups: Pearson’s \(\chi^2\) Test

The chi-squared test is useful for testing whether groups of categorical resemble each other. Consider the following example, which shows data on blood type and the observed incidence of a particular ailment:

	O	A	AB	B
no ailment	55	50	7	24
advanced ailment	7	5	3	13
minimal ailment	26	32	8	17

Suppose you’re interested in finding whether the incidence ailment is dependent on blood type. To test this, we define the null hypothesis that the incidence of ailment is not dependent on blood type (this means the data in different categories resemble each other). The alternative hypotheses that incidence of ailment is dependent on blood type (the data in different categories is different).

To perform this test, for each cell in the data we compute the squared difference between the observed count and the expected count, divided by the expected count:

\[\frac{(\text{observed-expected}^2)}{\text{expected}}\]

E.g. for the first cell in the table (the number of subjects with no ailment and blood type O), the observed count is 55 and the expected count is 48.45. To calculate the expected count, note that there are 88 subjects in total with blood type O, and 247 subjects in the dataset in total, meaning the proportion of subjects in the study with blood type O is \(\frac{88}{247} = 0.356\). Note also there are 136 subjects in in total with no ailment. This the expected count for subjects with no ailment and blood type O is \(136 \cdot \frac{88}{247}=48.45\). Thus:

\[\frac{(55-48.45)^2}{48.45}\]

The chi-squared test-statistic is the sum of these values for each cell in the table:

\[\chi^2 = \sum_i^k \frac{(X_i - \text{E}[X_i])^2}{\text{E}[X_i]}\]

Which, in this case, is:

\[\Longrightarrow \;\; \chi^2 = 15.797\]

6.1 The \(\chi^2\) distribution

The chi-squared test statistic follows the chi-squared distribution, which describes the distribution of a sum of several independent standard normal distributions. The chi-squared distribution has one parameter, degrees of freedom, calculated as:

\[\text{DoF} = (\text{number of rows - 1})(\text{number of cols - 1})\]

Below is a plot showing the chi-squared distribution for different DoFs:

6.2 The \(\chi^2\)-test

In the example above, DoF = \((3-1)(4-1) = 6\). The observed test statistic, \(\chi^2 = 15.797\), encloses the following region in a \(\chi^2\) distribution with 6 DoFs:

The shaded region is the \(p\)-value of the test, since this region contains values at least as extreme as the observed test statistic. As always, the size of the \(p\)-value will tell us whether to reject or retain the null, depending on the significance level of the test.

In R we can perform the test and find the \(p\)-value using the chisq.test() function, as follows:

## make contingency table 
ailment_study = as.table(rbind(c(55,50,7,24), c(7,5,3,13), c(26,32,8,17)))
dimnames(ailment_study) = list(' ' = c('no ailment','advanced ailment','minimal ailment'), ' ' = c('O','A','AB','B'))
ailment_study

##                    
##                     O  A AB  B
##   no ailment       55 50  7 24
##   advanced ailment  7  5  3 13
##   minimal ailment  26 32  8 17

## perform chi-squared test
chisq.test(ailment_study)

## 
##  Pearson's Chi-squared test
## 
## data:  ailment_study
## X-squared = 15.797, df = 6, p-value = 0.01489

The \(p\)-value of the test is 0.0149, which implies we can reject the null at the 5% level, and conclude that the incidence of ailment shows dependency on blood type.

Note rather than testing all groups at once (as a single hypothesis) we can also test each group (advanced ailment and minimal ailment) separately against the control group for dependence on blood type:

## test advanced ailament group against control group
chisq.test(ailment_study[c(1,2), ])

## 
##  Pearson's Chi-squared test
## 
## data:  ailment_study[c(1, 2), ]
## X-squared = 13.645, df = 3, p-value = 0.00343

## test minimal ailament group against control group
chisq.test(ailment_study[c(1,3), ])

## 
##  Pearson's Chi-squared test
## 
## data:  ailment_study[c(1, 3), ]
## X-squared = 2.9415, df = 3, p-value = 0.4007

Note how each of these groups, when tested separately against the control group, yields more variable results: the advanced ailment group has a significant \(p\)-value of 0.00343, but the minimal ailment group does not have a significant \(p\)-value. Conducting separate tests for each group is an example of multiple testing—some pitfalls of doing this are discussed next.

7 13–7 Multiple Testing

Testing many hypotheses is known as multiple testing. For any one test, the chance of a false rejection (type I error) is \(\alpha\). But when conducting many tests simultaneously, the chance of at least one false rejection is much larger:

\[\text{P}(\text{at least 1 false rejection in }m \text{ tests}) = 1 - \text{P}(\text{no false rejections in }m \text{ tests}) = 1-(1-\alpha)^m\]

E.g. if you conducted 10 tests simultaneously at the 5% level, the chance of at least one false rejection is:

\[1-(0.95)^{10} = 0.401\]

which is clearly much larger than the chance of false rejection in a single test. This is the multiple testing problem. It becomes particularly problematic when conducting a very large number (e.g. thousands or millions) of hypothesis tests simultaneously. The probability of making at least one false rejection in a sequence of hypothesis tests is known as the familywise error rate, or FWER.

The easiest way to get around this problem is to do as few tests as necessary—this will ensure a small FWER. But in contexts where multiple testing is necessary, there are a number of methods for dealing with this problem. Below we will briefly discuss two: the Bonferonni correction and the Benjamini-Hochberg correction.

7.1 The Bonferroni correction

The Bonferonni correction gives a way to control the familywise error rate (FWER). It works as follows.

Given \(m\) tests:

\[H_{0i} \;\; \text{vs.} \;\; H_{1i} \;\;\;\; \text{for} \;\; i = 1,...,m\]

If \(p_1, ..., p_m\) denote the \(p\)-values for these \(m\) tests, then reject the null hypothesis \(H_{0i}\) if and only if:

\[p_i < \frac \alpha m\]

This condition enforces the familywise error rate to be \(\text{FWER} \leq \alpha\). To see this:

\[\text{FWER} = \text{P}\Bigg\{ \bigcup_{i=1}^m \bigg( p_i \leq \frac \alpha m \bigg) \Bigg\} \leq \sum_i^m \bigg\{ \text{P}\bigg( p_i \leq \frac \alpha m \bigg) \bigg\} = \sum_i^m \frac \alpha m = \alpha\]

In R, you can adjust a vector of \(p\)-values with the Bonferonni correction by specifying the function p.adjust(p, method = p.adjust.methods = 'bonferonni'), where p is a numeric vector of \(p\)-values.

One issue with the Bonferroni method is that it’s very conservative—it tries to make it unlikely that you will make even one false rejection. It’s often more reasonable to control the false discovery rate (FDR)—this is the basis of the BH correction (next).

7.2 The Benjamini-Hochberg correction

The BH correction gives a way to control the false discovery rate (FDR), which is defined as the number of false rejections divided by the number of rejections.

Given \(m\) tests, the BH method works as follows:

arrange the observed \(p\)-values \(p_1, ..., p_m\) in increasing order, and assign ranks to each value based on its order, i.e. \(p_{(1)}, p_{(2)}, ..., p_{(m)}\)
for each \(p\)-value, compute the BH critical value, defined as \(\frac im \alpha\), where \(\alpha\) is the desired false discovery rate (e.g. 5% or 10%), and \(i\) is the rank of the \(p\)-value
compare the set of \(p\)-values to the BH critical value, and find the largest \(p\)-value that is smaller than the critical value
reject the null hypothesis (i.e. declare discoveries) for all \(p\)-value smaller than this one

If this procedure is applied, then regardless of how many null hypotheses are true, then the false discovery rate (FDR) will be:

\[\text{FDR} \leq \alpha\]

In R, you can adjust a vector of \(p\)-values with the BH correction by setting p.adjust(p, method = p.adjust.methods = 'fdr').