SVD exercises
{pagebreak}
Exercises
For these exercises we are again going to use:
library(tissuesGeneExpression)
data(tissuesGeneExpression)
Before we start these exercises, it is important to reemphasize that when using the SVD, in practice the solution to SVD is not unique. This because . In fact, we can flip the sign of each column of and, as long as we also flip the respective column in , we will arrive at the same solution. Here is an example:
s = svd(e)
signflips = sample(c(-1,1),ncol(e),replace=TRUE)
signflips
Now we switch the sign of each column and check that we get the same answer. We do this using the function sweep
. If x
is a matrix and a
is a vector then sweep(x,1,y,FUN="*")
applies the function FUN
to each row i
FUN(x[i,],a[i]), in this case
x[i,]*a[i]. If instead of 1 we use 2, sweep applies this to columns. To learn about sweep read
?sweep`.
newu= sweep(s$u,2,signflips,FUN="*")
newv= sweep(s$v,2,signflips,FUN="*" )
identical( s$u %*% diag(s$d) %*% t(s$v), newu %*% diag(s$d) %*% t(newv))
This is important to know because different implementations of the SVD algorithm may give different signs, which can lead to the same code resulting in different answers when run in different computer systems.
-
Compute the SVD of
e
s = svd(e)
Now compute the mean of each row:
m = rowMeans(e)
What is the correlation between the first column of and
m
? -
In exercise 1, we saw how the first column relates to the mean of the rows of
e
. If we change these means, the distances between columns do not change. For example, changing the means does not change the distances:newmeans = rnorm(nrow(e)) ##random values we will add to create new means newe = e+newmeans ##we change the means sqrt(crossprod(e[,3]-e[,45])) sqrt(crossprod(newe[,3]-newe[,45]))
So we might as well make the mean of each row 0, since it does not help us approximate the column distances. We will define
y
as the detrendede
and recompute the SVD:y = e - rowMeans(e) s = svd(y)
We showed that is equal to
y
up to numerical error:resid = y - s$u %*% diag(s$d) %*% t(s$v) max(abs(resid))
The above can be made more efficient in two ways. First, using the
crossprod
and, second, not creating a diagonal matrix. In R, we can multiply matricesx
by vectora
. The result is a matrix with rowsi
equal tox[i,]*a[i]
. Run the following example to see this.x=matrix(rep(c(1,2),each=5),5,2) x*c(1:5)
which is equivalent to:
sweep(x,1,1:5,"*")
This means that we don’t have to convert
s$d
into a matrix.Which of the following gives us the same as
diag(s$d)%*%t(s$v)
?- A)
s$d %*% t(s$v)
- B)
s$d * t(s$v)
- C)
t(s$d * s$v)
- D)
s$v * s$d
- A)
- If we define
vd = t(s$d * t(s$v))
, then which of the following is not the same :- A)
tcrossprod(s$u,vd)
- B)
s$u %*% s$d * t(s$v)
- C)
s$u %*% (s$d * t(s$v) )
- D)
tcrossprod( t( s$d*t(s$u)) , s$v)
- A)
-
Let
z = s$d * t(s$v)
. We showed derivation demonstrating that because is orthogonal, the distance betweene[,3]
ande[,45]
is the same as the distance betweeny[,3]
andy[,45]
. which is the same asvd[,3]
andvd[,45]
z = s$d * t(s$v) ##d was deinfed in question 2.1.5 sqrt(crossprod(e[,3]-e[,45])) sqrt(crossprod(y[,3]-y[,45])) sqrt(crossprod(z[,3]-z[,45]))
Note that the columns
z
have 189 entries, compared to 22,215 fore
.What is the difference, in absolute value, between the actual distance:
sqrt(crossprod(e[,3]-e[,45]))
and the approximation using only two dimensions of
z
? -
How many dimensions do we need to use for the approximation in exercise 4 to be within 10%?
-
Compute distances between sample 3 and all other samples.
- Recompute this distance using the 2 dimensional approximation. What is the Spearman correlation between this approximate distance and the actual distance?
The last exercise shows how just two dimensions can be useful to get a rough idea about the actual distances.