For these exercises we are again going to use:
Before we start these exercises, it is important to reemphasize that when using the SVD, in practice the solution to SVD is not unique. This because . In fact, we can flip the sign of each column of and, as long as we also flip the respective column in , we will arrive at the same solution. Here is an example:
s = svd(e) signflips = sample(c(-1,1),ncol(e),replace=TRUE) signflips
Now we switch the sign of each column and check that we get the same answer. We do this using the function
x is a matrix and
a is a vector then
sweep(x,1,y,FUN="*") applies the function
FUN to each row
, in this case x[i,]*a[i]
. If instead of 1 we use 2, sweep applies this to columns. To learn about sweep read ?sweep`.
newu= sweep(s$u,2,signflips,FUN="*") newv= sweep(s$v,2,signflips,FUN="*" ) identical( s$u %*% diag(s$d) %*% t(s$v), newu %*% diag(s$d) %*% t(newv))
This is important to know because different implementations of the SVD algorithm may give different signs, which can lead to the same code resulting in different answers when run in different computer systems.
Compute the SVD of
s = svd(e)
Now compute the mean of each row:
m = rowMeans(e)
What is the correlation between the first column of and
In exercise 1, we saw how the first column relates to the mean of the rows of
e. If we change these means, the distances between columns do not change. For example, changing the means does not change the distances:
newmeans = rnorm(nrow(e)) ##random values we will add to create new means newe = e+newmeans ##we change the means sqrt(crossprod(e[,3]-e[,45])) sqrt(crossprod(newe[,3]-newe[,45]))
So we might as well make the mean of each row 0, since it does not help us approximate the column distances. We will define
yas the detrended
eand recompute the SVD:
y = e - rowMeans(e) s = svd(y)
We showed that is equal to
yup to numerical error:
resid = y - s$u %*% diag(s$d) %*% t(s$v) max(abs(resid))
The above can be made more efficient in two ways. First, using the
crossprodand, second, not creating a diagonal matrix. In R, we can multiply matrices
a. The result is a matrix with rows
x[i,]*a[i]. Run the following example to see this.
which is equivalent to:
This means that we don’t have to convert
s$dinto a matrix.
Which of the following gives us the same as
s$d %*% t(s$v)
s$d * t(s$v)
t(s$d * s$v)
s$v * s$d
- If we define
vd = t(s$d * t(s$v)), then which of the following is not the same :
s$u %*% s$d * t(s$v)
s$u %*% (s$d * t(s$v) )
tcrossprod( t( s$d*t(s$u)) , s$v)
z = s$d * t(s$v). We showed derivation demonstrating that because is orthogonal, the distance between
e[,45]is the same as the distance between
y[,45]. which is the same as
z = s$d * t(s$v) ##d was deinfed in question 2.1.5 sqrt(crossprod(e[,3]-e[,45])) sqrt(crossprod(y[,3]-y[,45])) sqrt(crossprod(z[,3]-z[,45]))
Note that the columns
zhave 189 entries, compared to 22,215 for
What is the difference, in absolute value, between the actual distance:
and the approximation using only two dimensions of
How many dimensions do we need to use for the approximation in exercise 4 to be within 10%?
Compute distances between sample 3 and all other samples.
- Recompute this distance using the 2 dimensional approximation. What is the Spearman correlation between this approximate distance and the actual distance?
The last exercise shows how just two dimensions can be useful to get a rough idea about the actual distances.