Collinearity

Co-linearity

If an experiment is designed incorrectly we may not be able to estimate the parameters of interest. Similarly, when analyzing data we may incorrectly decide to use a model that can’t be fit. If we are using linear models then we can detect these problems mathematically by looking for collinearity in the design matrix.

System of equations example

The following system of equations:

$\begin{align*} a+c &=1\\ b-c &=1\\ a+b &=2 \end{align*}$

has more than one solution since there are an infinite number of triplets that satisfy $a=1-c, b=1+c$ . Two examples are $a=1,b=1,c=0$ and $a=0,b=2,c=1$ .

Matrix algebra approach

The system of equations above can be written like this:

$\, \begin{pmatrix} 1&0&1\\ 0&1&-1\\ 1&1&0\\ \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ 2 \end{pmatrix}$

Note that the third column is a linear combination of the first two:

$\, \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix} + -1 \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} = \begin{pmatrix} 1\\ -1\\ 0 \end{pmatrix}$

We say that the third column is collinear with the first 2. This implies that the system of equations can be written like this:

$\, \begin{pmatrix} 1&0&1\\ 0&1&-1\\ 1&1&0 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = a \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix} + b \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} + c \begin{pmatrix} 1-0\\ 0-1\\ 1-1 \end{pmatrix}$ $=(a+c) \begin{pmatrix} 1\\ 0\\ 1\\ \end{pmatrix} + (b-c) \begin{pmatrix} 0\\ 1\\ 1\\ \end{pmatrix}$

The third column does not add a constraint and what we really have are three equations and two unknowns: $a+c$ and $b-c$ . Once we have values for those two quantities, there are an infinity number of triplets that can be used.

Collinearity and least squares

Consider a design matrix $\mathbf{X}$ with two collinear columns. Here we create an extreme example in which one column is the opposite of another:

$\mathbf{X} = \begin{pmatrix} \mathbf{1}&\mathbf{X}_1&\mathbf{X}_2&\mathbf{X}_3\\ \end{pmatrix} \mbox{ with, say, } \mathbf{X}_3 = - \mathbf{X}_2$

This means that we can rewrite the residuals like this:

$\mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 + \mathbf{X}_3\beta_3\right\}\\ = \mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 - \mathbf{X}_2\beta_3\right\}\\ = \mathbf{Y}- \left\{\mathbf{1}\beta_0 + \mathbf{X}_1 \beta_1 + \mathbf{X}_2(\beta_2 - \beta_3)\right\}$

and if $\hat{\beta}_1$ , $\hat{\beta}_2$ , $\hat{\beta}_3$ is a least squares solution, then, for example, $\hat{\beta}_1$ , $\hat{\beta}_2+1$ , $\hat{\beta}_3+1$ is also a solution.

Confounding as an example

Now we will demonstrate how collinearity helps us determine problems with our design using one of the most common errors made in current experimental design: confounding. To illustrate, let’s use an imagined experiment in which we are interested in the effect of four treatments A, B, C and D. We assign two mice to each treatment. After starting the experiment by giving A and B to female mice, we realize there might be a sex effect. We decide to give C and D to males with hopes of estimating this effect. But can we estimate the sex effect? The described design implies the following design matrix:

$\, \begin{pmatrix} Sex & A & B & C & D\\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 1\\ \end{pmatrix}$

Here we can see that sex and treatment are confounded. Specifically, the sex column can be written as a linear combination of the C and D matrices.

$\, \begin{pmatrix} Sex \\ 0\\ 0 \\ 0 \\ 0 \\ 1\\ 1\\ 1 \\ 1 \\ \end{pmatrix} = \begin{pmatrix} C \\ 0\\ 0\\ 0\\ 0\\ 1\\ 1\\ 0\\ 0\\ \end{pmatrix} + \begin{pmatrix} D \\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 1\\ 1\\ \end{pmatrix}$

This implies that a unique least squares estimate is not achievable.

Rank

The rank of a matrix columns is the number of columns that are independent of all the others. If the rank is smaller than the number of columns, then the LSE are not unique. In R. we can obtain the rank of matrix with the function qr, which we will describe in more detail in a following section.

Sex <- c(0,0,0,0,1,1,1,1)
A <-   c(1,1,0,0,0,0,0,0)
B <-   c(0,0,1,1,0,0,0,0)
C <-   c(0,0,0,0,1,1,0,0)
D <-   c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")

## ncol= 5 rank= 4

Here we will not be able to estimate the effect of sex.

Removing Confounding

This particular experiment could have been designed better. Using the same number of male and female mice, we can easily design an experiment that allows us to compute the sex effect as well as all the treatment effects. Specifically, when we balance sex and treatments, the confounding is removed as demonstrated by the fact that the rank is now the same as the number of columns:

Sex <- c(0,1,0,1,0,1,0,1)
A <-   c(1,1,0,0,0,0,0,0)
B <-   c(0,0,1,1,0,0,0,0)
C <-   c(0,0,0,0,1,1,0,0)
D <-   c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")

## ncol= 5 rank= 5