Consider these design matrices:

  1. Which of the above design matrices does NOT have the problem of collinearity?

  2. The following exercises are advanced. Let’s use the example from the lecture to visualize how there is not a single best , when the design matrix has collinearity of columns. An example can be made with:

     sex <- factor(rep(c("female","male"),each=4))
     trt <- factor(c("A","A","B","B","C","C","D","D"))

    The model matrix can then be formed with:

     X <- model.matrix( ~ sex + trt)

    And we can see that the number of independent columns is less than the number of columns of X:


    Suppose we observe some outcome Y. For simplicity, we will use synthetic data:

     Y <- 1:8

    Now we will fix the value for two coefficients and optimize the remaining ones. We will fix and . Then we will find the optimal value for the remaining betas, in terms of minimizing the residual sum of squares. We find the value that minimize:

    where is the male column of the design matrix, is the D column, is a 1 by 3 matrix with the remaining column entries for unit , and is a 3 x 1 matrix with the remaining parameters.

    So all we need to do is redefine as and fit a linear model. The following line of code creates this variable , after fixing to a value a, and to a value, b:

     makeYstar <- function(a,b) Y - X[,2] * a - X[,5] * b

    Now we’ll construct a function which, for a given value a and b, gives us back the sum of squared residuals after fitting the other terms.

     fitTheRest <- function(a,b) {
       Ystar <- makeYstar(a,b)
       Xrest <- X[,-c(2,5)]
       betarest <- solve(t(Xrest) %*% Xrest) %*% t(Xrest) %*% Ystar
       residuals <- Ystar - Xrest %*% betarest

    What is the sum of squared residuals when the male coefficient is 1 and the D coefficient is 2, and the other coefficients are fit using the linear model solution?

  3. We can apply our function fitTheRest to a grid of values for and , using the outer function in R. outer takes three arguments: a grid of values for the first argument, a grid of values for the second argument, and finally a function which takes two arguments.

    Try it out:


    We can run fitTheRest on a grid of values, using the following code (the Vectorize is necessary as outer requires only vectorized functions):


    In the grid of values, what is the smallest sum of squared residuals?