14 Joint Variability

1 Variance, Covariance

In previous chapters we introduced you to the variance as a measure of spread of a distribution:

\[\text{Var}[X] = \text{E}\big[ (X - \text{E}[X] )^2 \big]\]

i.e. the variance is the expected value of the squared deviations from the mean. The variance can also be written as a sum:

\[\sigma^2 = \frac 1n \sum_i^n (X_i - \bar X)^2 \hspace{1cm} s^2 = \frac{1}{n-1} \sum_i^n (X_i - \bar X)^2\]

where \(\sigma^2\) denotes the population variance and \(s^2\) denotes the sample variance.

Variance is useful for characterizing the spread of one-dimensional data (i.e. a single variable). When dealing with multidimensional data (i.e. many variables) we’re often interested in characterizing the dependency (i.e. association) between two variables, i.e. the tendency for variables to vary together. We already introduced you to correlation as a measure of the strength of the linear association between two variables. We’ll now introduce the concept of dependence more formally.

Covariance is a measure of the joint variability of two variables—it gives the extent to which two variables tend to vary in the same direction.

Conceptually, covariance is similar to correlation. If two variables have a positive association, covariance is positive; if they have a negative association, covariance is negative. The difference between covariance and correlation is in their units—this point will become clear in the next section.

Formally, the covariance of two variables \(X\) and \(Y\) is defined:

\[\text{Cov}[X,Y] = \text{E}\big[ (X - \text{E}[X]) (Y - \text{E}[Y] )\big]\]

Or, written as a sum:

\[\sigma_{XY} = \frac 1n \sum_i^n (X_i - \bar X)(Y_i - \bar Y) \hspace{1cm} s_{XY} = \frac{1}{n-1} \sum_i^n (X_i - \bar X)(Y_i - \bar Y)\]

where \(\sigma_{XY}\) denotes the population covariance and \(s_{XY}\) denotes the sample covariance.

In R you can compute the covariance between two variables using cov(). Below is a plot of the relationship between temperature and albedo from the NYC heatwave data:

The covariance of temperature and vegetation is:

cov(nycheat$temperature, nycheat$albedo)
## [1] -18.99705

The units of covariance is simply the product of the units of the two variables.

With a matrix of numeric variables, we can also construct a covariance matrix showing the pairwise covariances between each of the variables:

vcov_table = cov(nycheat[ ,2:6])
kable(vcov_table)
temperature vegetation albedo building_height pop_density
temperature 25.2023494 -0.4213865 -18.9970471 5.0919835 5.168777e+03
vegetation -0.4213865 0.0216491 0.3142614 -0.1574037 -1.820730e-02
albedo -18.9970471 0.3142614 38.6188177 -3.6195019 -7.339874e+03
building_height 5.0919835 -0.1574037 -3.6195019 10.5018033 -1.664037e+03
pop_density 5168.7771526 -0.0182073 -7339.8740999 -1664.0374370 1.208001e+08

You may notice that the diagonal elements of this matrix appear to show the covariance of a variable with itself. As it turns out, the covariance of a variable with itself is simply its variance.

\[\text{Cov}[X,X] = \text{E}\big[ (X - \text{E}[X])(X - \text{E}[X]) \big] = \text{Var}[X]\]

For this reason the covariance matrix is also known as the variance-covariace matrix. Formally, the variance-covariance matrix is denoted \(\boldsymbol \Sigma\). For the three RVs \(X\), \(Y\) and \(Z\):

\[ \boldsymbol \Sigma = \begin{bmatrix} \sigma^2_X & \sigma_{XY} & \sigma_{XZ} \\ \sigma_{YX} & \sigma^2_y & \sigma_{YZ} \\ \sigma_{ZX} & \sigma_{ZY} & \sigma^2_Z \end{bmatrix} \]

Earlier in the course you may recall us saying that variance is only additive for independent RVs, i.e. 

\[\text{Var}[X+Y] = \text{Var}[X]+\text{Var}[Y] \hspace{1cm} (\text{iff X & Y independent})\]

We can now show why this is the case. Using the fact that \(\text{Var}[X] = \text{E}[X^2] - \text{E}[X]^2\), we can expand the \(\text{Var}[X+Y]\) as follows:

\[ \begin{aligned} \text{Var}[X+Y] &= \text{E}\big[ (X+Y)^2 \big] - \text{E}[X+Y]^2 \\ &= \text{E}\big[ X^2 + 2XY + Y^2 \big] - \big( \text{E}[X]+\ \text{E}[Y] \big)^2 \\ &= \text{E}[X^2]+\text{E}[2XY]+E[Y^2] - \text{E}[X]^2 - 2\text{E}[X]E[Y] - \text{E}[Y]^2 \\ &= \big( \text{E}[X^2] - \text{E}[X]^2 \big) + \big( \text{E}[Y^2] - \text{E}[Y]^2 \big) + 2 \big(E[XY] - \text{E}[X]\text{E}[Y] \big) \\ &= \text{Var}[X] + \text{Var}[Y] + 2\text{Cov}[X,Y] \end{aligned} \]

i.e. the variance of a sum of RVs has a third term that accounts for the dependency of the two variables. Naturally, if the variables are independent, they have zero covariance, and the expression for variance reduces to the simple two-term sum.

2 Correlation

One issue with covariance is that its units are the product of the units of the two variables. This means that covariance can take any number, and its magnitude depends on the units the variables are measured in—variables measured in large values will have large covariances, and variables measured in small values will have small covariances. E.g. in the covariance matrix above note that the covariance terms involving pop_density tend to be very large—this is because population takes values in the millions, while most of the other variables takes smaller values.

This makes it hard to compare covariances between different sets of variables, since covariance is intrinsically a unit-dependent quantity.

Correlation solves this issue by dividing covariance by the standard deviations of each of the variables. Formally, the correlation between two variables \(X\) and \(Y\) is defined:

\[\text{Cor}[X,Y] = \frac{\text{E}\big[ (X - \text{E}[X]) (Y - \text{E}[Y] )\big]}{\sigma_X \sigma_Y}\]

Or, written as a sum:

\[\rho_{XY} = \frac 1n \frac{\sum_i^n (X_i - \bar X)(Y_i - \bar Y)}{\sigma_X \sigma_Y} \hspace{1cm} r_{XY} = \frac{1}{n-1} \frac{\sum_i^n (X_i - \bar X)(Y_i - \bar Y)}{s_X s_Y}\]

where \(\rho_{XY}\) denotes the population correlation and \(r_{XY}\) denotes the sample correlation. Note this particular definition of correlation is known as Pearson’s product-moment correlation coefficient, or PPMCC, as the cool kids say.

Correlation is a unitless quantity—it effectively normalizes covariance, reducing it to a quantity between -1 and 1, making it possible to compare the strength of the joint variability (dependence) between sets of two variables.

In R you can use cor() to compute correlation coefficient between two variables:

cor(nycheat$temperature, nycheat$albedo)
## [1] -0.6089282

Below is a correlation matrix for the variables in the NYC heatwave dataset:

cor_table = cor(nycheat[ ,2:6])
kable(cor_table)
temperature vegetation albedo building_height pop_density
temperature 1.0000000 -0.5704794 -0.6089282 0.3129929 0.0936772
vegetation -0.5704794 1.0000000 0.3436937 -0.3301132 -0.0000113
albedo -0.6089282 0.3436937 1.0000000 -0.1797287 -0.1074622
building_height 0.3129929 -0.3301132 -0.1797287 1.0000000 -0.0467194
pop_density 0.0936772 -0.0000113 -0.1074622 -0.0467194 1.0000000

Since correlation is unitless, we can directly compare the correlations between different sets of variables (e.g. we can say the correlation between temperature and building_height is stronger than the correlation between temperature and pop_density. Note how we cannot do this with the covariance matrix.)

Note that correlation (and covariance) are only receptive to linear relationships between variables. Below are some examples of scatterplots showing different kinds of association between two variables:

The bottom row shows examples of nonlinear associations. Though these variables are clearly associated, they have zero correlation.

3 Multivariate Normal

We introduced the normal distribution in chapter 9, as the continuous probability distribution with the density function:

\[\text{pdf}(X) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(X-\mu)^2/2\sigma^2}\]

Sometimes this is known as the univariate normal distribution, since it’s a function of one random variable, \(X\).

Often in statistics we must deal with a combination of random variables (e.g. in multiple regression). In these situations we can generalize the simple normal distribution to higher dimensions—this is known as the multivariate normal or joint normal distribution.

Consider a random vector \({\boldsymbol X} = ({\boldsymbol X}_1, {\boldsymbol X}_2, ..., {\boldsymbol X}_k)\), which is a collection of \(k\) random variables. This random vector is multivariate normal if every linear combination of its RVs is normally distributed, e.g. if

\[a_1 {\boldsymbol X}_1 + a_2 {\boldsymbol X}_2 + ... + a_k {\boldsymbol X}_k\]

has a normal distribution for each of the constants \(a_1, a_2, ...\) etc.

The random vector \({\boldsymbol X}\) is said to have mean vector \({\boldsymbol \mu}\) and covariance matrix \({\boldsymbol \Sigma}\):

\[{\boldsymbol \mu }= \big[ \text{E}[X_1], \text{E}[X_2], ..., \text{E}[X_k] \big]\]  

\[ {\boldsymbol \Sigma }= \begin{bmatrix} \sigma^2_{X_1} & \sigma_{X_1,X_2} & ... & \sigma_{X_1,X_k} \\ \sigma_{X_2,X_1} & \sigma^2_{X_2} & ... & \sigma_{X_2,X_k} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{X_k, X_1} & \sigma_{X_k, X_2} & ... & \sigma^2_{X_k} \end{bmatrix} \]

Its joint probability density function is:

\[\frac{1}{\sqrt{(2\pi)^k | {\boldsymbol \Sigma }|}} \exp\big(-\tfrac 12 ({\boldsymbol X} - {\boldsymbol \mu})^T {\boldsymbol \Sigma}^{-1} ({\boldsymbol X} -{\boldsymbol \mu}) \big)\]

Which can be expressed using the following shorthand:

\[{\boldsymbol X} \sim \mathcal N \big( {\boldsymbol \mu}, {\boldsymbol \Sigma }\big)\]

Example—consider the random vector \({\boldsymbol X} = (X_1, X_2)\), which is bivariate normal with mean vector \({\boldsymbol \mu }= \big[\begin{smallmatrix} 2 \\ 1 \end{smallmatrix} \big]\) and covariance matrix \({\boldsymbol \Sigma }= \big[ \begin{smallmatrix} 1 \\ 0 \end{smallmatrix} \begin{smallmatrix} 0 \\ 1.5 \end{smallmatrix} \big]\) (i.e. \(X_1\) and \(X_2\) are uncorrelated). Below is a plot showing the 3D bivariate normal for \(X_1\) and \(X_2\), overlaid on a heatmap:

Below is an example with correlated RVs—consider the random vector \({\boldsymbol X}\) with mean vector \({\boldsymbol \mu }= \big[\begin{smallmatrix} 2 \\ 1 \end{smallmatrix} \big]\) and covariance matrix \({\boldsymbol \Sigma }= \big[ \begin{smallmatrix} 1 \\ 0.4 \end{smallmatrix} \begin{smallmatrix} 0.9 \\ 1.5 \end{smallmatrix} \big]\):